← Back to Research Reasoning training · Master Distillation

Grounded Chess Reasoning

A chess engine is like a master craftsman: flawless work, but unable to teach. Master Distillation gets a 4B model to both play correctly and explain clearly.

arXiv GitHub
4BQwen3 base
48.1%puzzle accuracy
+7.2 ppRLVR over SFT
~178tokens per solution
~100×more compact than GPT-5

Chess engines have outplayed world champions for years, but ask one why a move is best and you get an evaluation number. Human coaches can explain; large language models are articulate but blunder on concrete positions. Grounded Chess Reasoning studies exactly that gap: can a 4B model learn to be right like the engine and explain like a coach?

Engines play well but cannot explain

The gap is not unique to chess. Many professional fields rely on expert systems that are accurate but cannot teach (a solver — a program that produces guaranteed-correct answers): the answer is reliable, yet there is no human-readable justification behind it. Like a master craftsman whose work is flawless but who cannot say why he does what he does — an apprentice can watch for ten years and still not learn the trade. Translating that deterministic expertise into reasoning a person can read and a model can learn from is a fairly general problem.

Chess happens to be an ideal place to study it. Puzzles have a single best first move that an engine can verify instantly; at the same time, chess reasoning is unfriendly territory for large models — applying RLVR directly (reinforcement learning with verifiable rewards — scoring only whether the final answer is right) barely gets training off the ground. That combination of difficulty and verifiability makes chess a fitting testbed for “distill the expert process first, then reinforce it with correctness.”

Master Distillation: translating expert knowledge into reasoning

Master Distillation combines two kinds of systems. Stockfish (the strongest open-source chess engine, far beyond human champions) supplies deterministic ground truth — which move is best. Gemini-3-Flash verbalizes the engine’s judgment into natural-language reasoning traces. A 4B-parameter student model, C1, then learns from those traces.

Training has two stages. The first performs supervised fine-tuning (SFT — teaching the model directly from demonstration text) on the verbalized expert traces; the second applies RLVR with verifiable rewards. The order matters: skipping straight to reinforcement learning barely works — when the base model is too weak there is no signal for the reward to amplify, the so-called cold-start problem. SFT seeds the capability; RLVR then amplifies it.

The composition of the training data matters too. Tactical themes in chess puzzles are naturally imbalanced, and common motifs would drown out rare ones; the paper uses theme-balanced sampling (Algorithm 1) to keep rare themes represented in the training set.

Algorithm 1 Theme-Balanced Data Sampling
Require:Dataset D where each puzzle p has theme set T(p)
Require:Number of rare themes to balance K
Require:Maximum samples per theme M
Ensure:Balanced subset Dbal
  1. Compute theme frequencies: f(t) ← |{p ∈ D : t ∈ T(p)}| for all themes t
  2. Select rare themes: Trare ← arg minK f(t)
  3. Initialize selected IDs: S ← ∅
  4. Initialize output: Dbal ← ∅
  5. for each theme t ∈ Trare do
  6. Ct ← {p ∈ D : t ∈ T(p) ∧ id(p) ∉ S}
  7. Sample min(M, |Ct|) puzzles from Ct without replacement
  8. Dbal ← Dbal ∪ sampled puzzles
  9. S ← S ∪ {id(p) : p ∈ sampled puzzles}
  10. end for
  11. return Dbal

What a 4B model achieves

On the theme-balanced evaluation set, C1-4B reaches 48.1% puzzle accuracy — above most frontier models in the table, and above Gemini-3-Flash (40.8%), the very model that verbalized its training traces. The reinforcement stage adds 7.2 percentage points on top of SFT.

Table 1. Performance comparison across difficulty levels and models
ModelBeginnerIntermediateAdvancedExpertTheme-SplitAvg AccAvg Tokens
gpt-595.084.054.031.085.276.712,193
gemini-3-pro88.086.070.044.078.275.43,182
gemini-3-flash65.059.034.019.038.040.86,418
gpt-5-chat52.039.027.018.041.838.3925
gemini-2.5-pro37.031.029.019.031.030.19,668
claude-sonnet-4.532.029.015.011.028.625.63,227
claude-sonnet-435.019.016.010.026.823.88,028
claude-haiku-4.533.024.014.011.025.623.38,111
gemini-2.5-flash9.04.06.05.08.27.29,991
deepseek-chat-v3.127.021.06.016.022.020.011,249
qwen3-next-80b-a3b24.014.014.08.017.616.413,938
deepseek-r1-052811.010.014.016.016.014.614,442
qwen3-max22.015.03.016.013.813.93,393
llama-4-maverick12.08.05.010.08.68.71,092
mistral-medium-3.19.06.07.04.08.07.32,818
llama-4-scout0.00.00.01.00.40.3806
gemma-3-27b0.00.00.00.00.00.0705
C1-SFT-4B51.030.030.026.046.240.9188
C1-SFT-8B57.036.027.027.046.642.2189
C1-4B65.039.039.022.053.648.1178

Output length is just as notable: C1 averages about 178 tokens per solution (token — the unit of model output, roughly half a word to a word), about two orders of magnitude shorter than reasoning-LLM baselines, closer to how a human coach explains a move in a few sentences. The ablations (Table 2) also show that data scale and theme balance both materially affect the final capability.

Table 2. Ablation study on SFT data configurations
ScaleDistributionQualityContextSFT
8krandomflashfull19.3
8khardflashfull16.2
8kbalancedprofull22.8
8kbalancedflashfull20.1
16kbalancedflashfull29.7
8kbalancedflashMulti PVs17.6
8kbalancedflashw/o Theme17.3
8kbalancedflashw/o Feigned16.3
39kbalancedflashfull40.9
C1 accuracy versus model size
Paper-native figure. C1-4B is competitive with much larger frontier systems.
C1 average accuracy results
Paper-native figure. C1-4B beats all open-source baselines and surpasses the Gemini-3-Flash verbalizer.
  • Small model, strong reasoning.C1-4B reaches 48.1% on theme-balanced puzzles.
  • Explanations are compact.It emits about 178 tokens per solution, roughly two orders of magnitude shorter than reasoning-LLM baselines.
  • The student can exceed the verbalizer.Stockfish truth plus RLVR reward lets C1 outperform Gemini-3-Flash.

Beyond the chessboard

Master Distillation is not chess-specific.Any domain with a strong expert system and verifiable answers can potentially use the recipe.
A solution to RLVR cold start.When the base model is too weak, expert traces seed enough capability for reward learning to climb.
Built for production budgets.Many deployments cannot use the largest model; specialized reasoning at 4B scale is practically relevant.

Open the code and paper

Code and data are available for reproduction or adaptation to other solver-backed domains.

arXiv GitHub