← Back to Research RLVR · Self-refinement

ThinkTwice

A student who grinds problem sets but never checks their paper won’t fix mistakes in the exam — neither do models. ThinkTwice trains “checking your work” into the objective, at about 3% extra cost.

arXiv GitHub Hugging Face
+11.5 ptAIME pass@4
5math benchmarks
2model families
1reward signal
+3%training overhead

When a model gets something wrong and you ask it to double-check, it usually repeats itself — or talks itself out of a correct answer. ThinkTwice starts from that observation: if taking a second look is a skill, can it be trained directly? The experiments answer yes: no extra supervision signal, about 3% additional training cost, and an 11.5-point gain on competition math.

ThinkTwice overview figure
Paper-native figure. ThinkTwice jointly trains base reasoning and self-refinement instead of treating refinement as an inference-time trick.

Why “check your work” rarely works

The dominant way to train reasoning models today is RLVR (reinforcement learning with verifiable rewards — scoring only whether the final answer is right): the model solves large volumes of math and code problems, earning reward when the answer checks out. This steadily sharpens the model’s first attempt, which is why competition-math scores have risen so quickly.

But nowhere in that process does the model practice reviewing itself. It has seen millions of first-pass solutions and almost no examples of taking its own answer and making it right. It is much like a student who only grinds through problem sets and never checks their own paper: told to “look it over again” in the exam hall, they mostly just copy out the same answers. Hence the scene above: asked to double-check, the model either repeats its previous answer or rewrites it with no sense of direction.

Existing remedies mostly require extra resources: a separately trained critic model that spots errors, process rewards that grade every reasoning step, or human-written critique data. ThinkTwice takes a different route: introduce no new signal, and let the model practice both behaviors — solving and self-correcting — under the same right-or-wrong reward.

One reward, two behaviors

Training steps come in pairs. The first step is standard GRPO (a widely used RLVR algorithm: sample several solutions to the same problem and update the model on their relative quality within the group). The model produces several solutions and receives a binary correctness reward — right or wrong, no partial credit.

The second step hands those solutions back to the model, asks it to refine them, and applies the same reward to the result. The reward function never changes; what changes is that it now covers both solving and fixing. The whole pipeline needs no critic model, no process reward, and no human critique data, at about 3% extra training cost.

ThinkTwice method diagram
Paper-native figure. The same correctness reward is used twice, keeping supervision cost low.

Evidence across five math benchmarks

Experiments cover five math benchmarks — MATH500, AIME 2024, AMC, Minerva Math, OlympiadBench — and two model families, Qwen3-4B and OLMo-3-7B. The representative result is Qwen3-4B on AIME 2024, measured as pass@4 (four attempts allowed; any correct one counts):

SettingAIME pass@4 gain
Single-pass GRPObaseline
ThinkTwice, no refinement at inference+5.0
ThinkTwice, one refinement pass+11.5

The two layers of improvement come from different places. Even with refinement never invoked, the ThinkTwice-trained model answers 5.0 points above the baseline on its first attempt — practicing correction improves first-pass solving. Invoking one refinement pass brings the total gain to 11.5 points.

ThinkTwice training transition
Paper-native figure. Training exhibits an implicit curriculum: first fixing wrong answers, then preserving correct ones.
ThinkTwice refinement curves
Paper-native figure. Refinement adds gains across multiple pass@k settings.

Training also reveals an interesting dynamic: the model first learns to turn wrong answers into right ones, then gradually learns to preserve answers that are already correct — an implicit curriculum nobody designed. The refinement skill transfers across models, too: it fixes other models’ solutions just as well, suggesting it has learned more than its own output style.

ThinkTwice cross-model refinement heatmap
Paper-native figure. Refinement transfers across models, suggesting the behavior is not only memorizing the model’s own output style.

What it suggests for reasoning training

“Think twice” can be trained directly.Self-refinement can be learned by the RL objective itself, with no reliance on prompt engineering.
The cost structure is clean.The method still needs only verifiable final answers, not a critique model or critique data.
Reasoning and refinement reinforce each other.The one-shot model improves, and refinement adds further gains on top.

Open the paper and code

ThinkTwice code and paper links are available for reproduction or adaptation to other verifiable tasks.

arXiv GitHub Hugging Face