← Research Method · Preprint, April 2026

ThinkTwice

Jointly optimizing reasoning and self-refinement under a single binary reward.

arXiv GitHub Hugging Face
+11.5 ptAIME pass@4 (Qwen3-4B)
5math benchmarks
2model families
1reward signal

Most RLVR-trained reasoners flatline — or regress — when you ask them to revise their own answer. ThinkTwice trains the revision skill directly: each step samples a solution, then asks the model to refine it, and updates both passes with the same binary correctness reward and the same GRPO objective. No critic model. No human-written critiques. No process reward model. Just one signal — is the final answer right — applied twice.

Background

Self-refine prompts look great in inference-time demos and break in production. The reason is simple: the model was never trained to make the second attempt better than the first. Vanilla RLVR optimizes single-shot correctness; the resulting policy has no incentive to use a second pass productively, and often the second pass actively erases what was right about the first.

Method

ThinkTwice extends Group Relative Policy Optimization (GRPO) with paired training steps. Phase A is standard RLVR: sample solutions to a math problem, score with a binary correct/incorrect reward, update with GRPO. Phase B feeds the model its own phase-A outputs and asks it to refine them; refinements are scored with the same binary reward and updated with the same GRPO objective. There is no auxiliary supervision: no teacher critique, no labeled error trace, no learned reward model.

The only label needed is "right answer" — applied twice.

Training-dynamics analysis surfaces an emergent "rectify-then-fortify" curriculum: early in training the refinement step mostly fixes wrong answers; later, as base accuracy climbs, it shifts to preserving correct ones. The authors argue this is what gives ThinkTwice a cleaner reward signal than pure single-pass GRPO — the second-pass reward distribution is shaped by the first pass.

Results

Five math benchmarks (MATH500, AIME 2024, AMC, Minerva Math, OlympiadBench) × two model families (Qwen3-4B-Instruct-2507, OLMo-3-7B-Instruct), pass@k across k = 1, 2, 4, 8, 16, 32+. The headline number is on AIME 2024 with Qwen3-4B:

Setup Δ pass@4
GRPO (single pass) baseline
ThinkTwice (no refinement) +5.0
ThinkTwice (one refinement step) +11.5

AIME 2024, pass@4, Qwen3-4B-Instruct-2507. The +5.0 row is what you get from ThinkTwice's training even without using the refinement at inference; +11.5 is what you get when you use it.

What we found

  • Reasoning improves even without using refinement at inference. The training objective itself sharpens single-shot performance.
  • Refinement compounds the gain. One refinement pass adds another ~6.5 pp on top of that.
  • Cross-model refinement also works. A ThinkTwice-trained model can refine outputs from a different base model, suggesting the skill generalizes beyond its own self-distribution.

Why it matters

Self-refinement that survives RL. The recipe gives back the inference-time "try again" lever that vanilla RLVR usually destroys. That's the practical payoff: you can deploy the model with a refinement pass and get strictly more accuracy.
No extra supervision. No critic, no PRM, no critique annotations — just verifiable rewards. Cheap and reproducible, which matters for any team trying to do RL training without a reward-model factory behind them.
One objective, two skills. Reasoning and revision improve together rather than trading off. That suggests they share representational structure that joint training can exploit — and that the field's habit of treating "self-refinement" as a separate inference trick is leaving capability on the table.