People expect an AI to fix its answer when asked to double-check; standard training does not produce that ability. ThinkTwice writes it directly into the training objective.
Launch Highlights
- Problem:Standard reinforcement learning rewards only the first attempt; the model never practices reviewing and correcting itself.
- Method:Solve once, feed the answer back, refine once; both steps use only final correctness.
- Finding:Qwen3-4B gains +11.5pt on AIME pass@4, with refinement adding on top of one-shot gains.
- Why it matters:No critic, no process reward, no human critique data: self-refinement can be trained directly.
Continue reading
The research page covers the background, method, key figures, and paper links; for quick sharing, use the illustrated promo copy.