Most RLVR-trained reasoners flatline — or regress — when you ask them to revise their own answer. ThinkTwice trains the revision skill directly: each step samples a solution, then asks the model to refine it, and updates both passes with the same binary correctness reward and the same GRPO objective. No critic model. No human-written critiques. No process reward model. Just one signal — is the final answer right — applied twice.
Background
Self-refine prompts look great in inference-time demos and break in production. The reason is simple: the model was never trained to make the second attempt better than the first. Vanilla RLVR optimizes single-shot correctness; the resulting policy has no incentive to use a second pass productively, and often the second pass actively erases what was right about the first.
Method
ThinkTwice extends Group Relative Policy Optimization (GRPO) with paired training steps. Phase A is standard RLVR: sample solutions to a math problem, score with a binary correct/incorrect reward, update with GRPO. Phase B feeds the model its own phase-A outputs and asks it to refine them; refinements are scored with the same binary reward and updated with the same GRPO objective. There is no auxiliary supervision: no teacher critique, no labeled error trace, no learned reward model.
The only label needed is "right answer" — applied twice.
Training-dynamics analysis surfaces an emergent "rectify-then-fortify" curriculum: early in training the refinement step mostly fixes wrong answers; later, as base accuracy climbs, it shifts to preserving correct ones. The authors argue this is what gives ThinkTwice a cleaner reward signal than pure single-pass GRPO — the second-pass reward distribution is shaped by the first pass.
Results
Five math benchmarks (MATH500, AIME 2024, AMC, Minerva Math, OlympiadBench) × two model families (Qwen3-4B-Instruct-2507, OLMo-3-7B-Instruct), pass@k across k = 1, 2, 4, 8, 16, 32+. The headline number is on AIME 2024 with Qwen3-4B:
| Setup | Δ pass@4 |
|---|---|
| GRPO (single pass) | baseline |
| ThinkTwice (no refinement) | +5.0 |
| ThinkTwice (one refinement step) | +11.5 |
AIME 2024, pass@4, Qwen3-4B-Instruct-2507. The +5.0 row is what you get from ThinkTwice's training even without using the refinement at inference; +11.5 is what you get when you use it.
What we found
- Reasoning improves even without using refinement at inference. The training objective itself sharpens single-shot performance.
- Refinement compounds the gain. One refinement pass adds another ~6.5 pp on top of that.
- Cross-model refinement also works. A ThinkTwice-trained model can refine outputs from a different base model, suggesting the skill generalizes beyond its own self-distribution.