Grounded Chess Reasoning - Coolwei AI Lab

Frontier LLMs can play chess, but they rarely explain their moves the way a player would. C1 is a 4B-parameter open model that does both — solving 48.1% of theme-balanced chess puzzles in roughly 178 tokens per solution, beating its own teacher (Gemini-3-Flash, 40.8%) and every open-source model, while emitting roughly two orders of magnitude fewer tokens than reasoning-LLM baselines. The recipe — Master Distillation — is pitched as a general way to bootstrap LLM reasoning in any domain where a deterministic expert system already exists.

Background

In domains where bespoke solvers already exist — chess engines, theorem provers, chemistry simulators — LLMs sit awkwardly between two failure modes. Pure neural chess players answer "which move?" but can't explain their choice. Pure LLMs explain confidently but invent illegal moves. The gap is grounded reasoning: answers a player would accept, written in language a player would write.

Prior work on chess + LLMs has tried to close this gap with chain-of-thought prompting on top of Stockfish move probabilities, or with bare-action models that win games but produce no human-readable rationale. Both halves of the system stay legible, but neither half is grounded.

Method

Master Distillation

Two specialists collaborate on the data. Stockfish at depth 24 supplies ground-truth principal variations. Gemini-3-Flash verbalizes them as natural-language chains of thought. The student model — Qwen3-4B-Instruct-2507 — sees only the verbalized traces. Stockfish is the source of truth; Gemini is the verbalizer; neither is "the teacher" by itself.

Feigned Discovery Prompting

The teacher LLM is told to reason as if the answer were unknown, while covertly tracking the master trace. The prompt enforces length (4–10 sentences scaled to puzzle difficulty), explicit board-coordinate grounding, objective voice, and no leakage of engine scores or theme labels. The result reads like a player thinking through a position, not a post-hoc rationalization of an answer key.

The teacher reasons as if it doesn't know the answer — and the student inherits the habit.

Two-stage training

The recipe is conventional but tightly tuned. Phase one is supervised fine-tuning on the verbalized traces with theme-balanced sampling across openings, middle games, endgames, and tactical motifs. Phase two is RLVR with DAPO, a GRPO variant tuned for concise outputs (KL retained, overlong-reward shaping removed). Skipping phase one and going straight to RL barely works — SFT seeds the capability that RLVR then sharpens.

Results

On a theme-balanced puzzle suite, C1-4B solves 48.1% of positions in an average of 178 tokens. That beats the Gemini-3-Flash teacher and every open-source baseline; it trails only top frontier proprietary systems. The story is not "SOTA on chess" — it's "best by far at a given budget."

Model	Accuracy	Avg. tokens
C1-4B (ours)	48.1%	178
Gemini-3-Flash (verbalizer / "teacher")	40.8%	—
GPT-5	85.2%	12,193
Gemini-3-Pro	78.2%	3,182
DeepSeek-Chat-v3.1	20.0%	11,249

Theme-balanced puzzle accuracy and average output length. C1-4B is roughly 100× more compact than GPT-5 and 18× more compact than Gemini-3-Pro per solution; an 8B variant exists but is omitted from this table.

What we found

+7.2 pp from RLVR over SFT-only (40.9% → 48.1%) — the gain from the second stage is real but only available if the first stage seeds it.
A student can surpass its verbalizer. Distillation isn't capability compression here; it's capability transfer plus reinforcement.
Theme-balanced sampling matters. Without it, RL overfits a narrow set of tactical motifs.

Why it matters

A general recipe, demonstrated on chess. Anywhere there's a deterministic expert system — theorem provers, protein design, medical decision support — Master Distillation gives a path to compact, explainable LLM reasoning. Chess happens to be a clean testbed; the method is the contribution.

A way to unlock RLVR when the base model is too weak. SFT on verbalized expert traces seeds enough capability that RLVR has a signal to climb. Skip the SFT and RL has nothing to amplify — that's the most actionable finding for anyone doing RL with verifiable rewards.

Compact and cheap, on purpose. A 4B open model is not the SOTA on chess. But it is the SOTA given a budget, and "given a budget" is what most production deployments actually look like.