SWE-Bench Mobile takes the agent-coding evaluation paradigm out of open-source GitHub repos and into a real, shipping mobile product. Fifty engineering tasks lifted from Xiaohongshu's production iOS app — each paired with the original PRD, Figma design, and a hand-authored test suite — force agents to read multi-modal specs and edit a large mixed Swift/Objective-C codebase the way a real iOS engineer would. Even the strongest commercial agent–model combos solve only 12%. And which agent you pick matters as much as which model you pick — the same model can vary up to 6× in pass rate depending on its scaffold.
Background
Prior agent-coding benchmarks under-test real-world engineering in four ways at once. Open-source repos leak into pretraining; tasks are bug fixes rather than new features; specs are GitHub issues rather than design documents; and the tests already exist. SWE-Bench Mobile flips all four. The codebase is a real production iOS app, the tasks are feature additions backed by PRDs and Figma designs, and the evaluation harness is hosted-only — so test sets never leak into training data.
What's in the benchmark
| Source | Xiaohongshu (Little Red Book) production iOS app |
| Languages | Swift + Objective-C (mixed) |
| Tasks | 50 |
| Test cases | 449 (~9 per task) |
| Inputs per task | PRD + Figma design + codebase snapshot (multi-modal) |
| Output | unified diff |
| Task mix | UI Components 18 · Data Management 10 · Gestures 8 · Media 7 · Networking 4 · Other 3 |
| Task type | feature additions (not bug fixes) |
| Evaluation | hosted-only (anti-contamination) |
Benchmark composition. Task type is the most consequential difference from prior agent benchmarks: feature additions force the agent to build something, not just fix something.
Results
Twenty-two agent–model configurations were evaluated across four agents (Cursor, Codex, Claude Code, OpenCode) crossed with leading commercial and open models. Top of the leaderboard:
| Agent + Model | Tasks solved | Tests passed |
|---|---|---|
| Cursor + Claude Opus 4.5 | 12.0% | 28.1% |
| Cursor + Claude Sonnet 4.5 | 12.0% | 26.7% |
| Codex + GLM 4.6 | 12.0% | 19.6% |
Top of the SWE-Bench Mobile leaderboard. The three leaders are tied at 12% task pass rate but separated by 8.5 pp on tests — a coarse vs. fine-grained capability gap. Live results: swebenchmobile.com.
What we found
- Same model, different agent — up to 6× spread. Scaffolding rivals model choice in importance.
- Simple beats elaborate. A "Defensive Programming" prompt outperforms more elaborate prompting strategies by +7.4 pp.
- The tests-passed column matters. Tasks that look "failed" at the binary level often pass a meaningful fraction of their tests — useful signal that's invisible if you only track pass@1.
The same model can vary up to 6× across agents. Reports that name only the LLM are missing half the story.