Frontier LLMs can play chess, but they rarely explain their moves the way a player would. C1 is a 4B-parameter open model that does both — solving 48.1% of theme-balanced chess puzzles in roughly 178 tokens per solution, beating its own teacher (Gemini-3-Flash, 40.8%) and every open-source model, while emitting roughly two orders of magnitude fewer tokens than reasoning-LLM baselines. The recipe — Master Distillation — is pitched as a general way to bootstrap LLM reasoning in any domain where a deterministic expert system already exists.
Background
In domains where bespoke solvers already exist — chess engines, theorem provers, chemistry simulators — LLMs sit awkwardly between two failure modes. Pure neural chess players answer "which move?" but can't explain their choice. Pure LLMs explain confidently but invent illegal moves. The gap is grounded reasoning: answers a player would accept, written in language a player would write.
Prior work on chess + LLMs has tried to close this gap with chain-of-thought prompting on top of Stockfish move probabilities, or with bare-action models that win games but produce no human-readable rationale. Both halves of the system stay legible, but neither half is grounded.
Method
Master Distillation
Two specialists collaborate on the data. Stockfish at depth 24 supplies ground-truth principal variations. Gemini-3-Flash verbalizes them as natural-language chains of thought. The student model — Qwen3-4B-Instruct-2507 — sees only the verbalized traces. Stockfish is the source of truth; Gemini is the verbalizer; neither is "the teacher" by itself.
Feigned Discovery Prompting
The teacher LLM is told to reason as if the answer were unknown, while covertly tracking the master trace. The prompt enforces length (4–10 sentences scaled to puzzle difficulty), explicit board-coordinate grounding, objective voice, and no leakage of engine scores or theme labels. The result reads like a player thinking through a position, not a post-hoc rationalization of an answer key.
The teacher reasons as if it doesn't know the answer — and the student inherits the habit.
Two-stage training
The recipe is conventional but tightly tuned. Phase one is supervised fine-tuning on the verbalized traces with theme-balanced sampling across openings, middle games, endgames, and tactical motifs. Phase two is RLVR with DAPO, a GRPO variant tuned for concise outputs (KL retained, overlong-reward shaping removed). Skipping phase one and going straight to RL barely works — SFT seeds the capability that RLVR then sharpens.
Results
On a theme-balanced puzzle suite, C1-4B solves 48.1% of positions in an average of 178 tokens. That beats the Gemini-3-Flash teacher and every open-source baseline; it trails only top frontier proprietary systems. The story is not "SOTA on chess" — it's "best by far at a given budget."
| Model | Accuracy | Avg. tokens |
|---|---|---|
| C1-4B (ours) | 48.1% | 178 |
| Gemini-3-Flash (verbalizer / "teacher") | 40.8% | — |
| GPT-5 | 85.2% | 12,193 |
| Gemini-3-Pro | 78.2% | 3,182 |
| DeepSeek-Chat-v3.1 | 20.0% | 11,249 |
Theme-balanced puzzle accuracy and average output length. C1-4B is roughly 100× more compact than GPT-5 and 18× more compact than Gemini-3-Pro per solution; an 8B variant exists but is omitted from this table.
What we found
- +7.2 pp from RLVR over SFT-only (40.9% → 48.1%) — the gain from the second stage is real but only available if the first stage seeds it.
- A student can surpass its verbalizer. Distillation isn't capability compression here; it's capability transfer plus reinforcement.
- Theme-balanced sampling matters. Without it, RL overfits a narrow set of tactical motifs.