← Research Benchmark · Preprint, February 2026

SWE-Bench Mobile

Can large-language-model agents develop industry-level mobile applications?

arXiv Project & leaderboard
50tasks
449test cases
22agent–model configs
12%top score (tasks solved)

SWE-Bench Mobile takes the agent-coding evaluation paradigm out of open-source GitHub repos and into a real, shipping mobile product. Fifty engineering tasks lifted from Xiaohongshu's production iOS app — each paired with the original PRD, Figma design, and a hand-authored test suite — force agents to read multi-modal specs and edit a large mixed Swift/Objective-C codebase the way a real iOS engineer would. Even the strongest commercial agent–model combos solve only 12%. And which agent you pick matters as much as which model you pick — the same model can vary up to in pass rate depending on its scaffold.

Background

Prior agent-coding benchmarks under-test real-world engineering in four ways at once. Open-source repos leak into pretraining; tasks are bug fixes rather than new features; specs are GitHub issues rather than design documents; and the tests already exist. SWE-Bench Mobile flips all four. The codebase is a real production iOS app, the tasks are feature additions backed by PRDs and Figma designs, and the evaluation harness is hosted-only — so test sets never leak into training data.

What's in the benchmark

SourceXiaohongshu (Little Red Book) production iOS app
LanguagesSwift + Objective-C (mixed)
Tasks50
Test cases449 (~9 per task)
Inputs per taskPRD + Figma design + codebase snapshot (multi-modal)
Outputunified diff
Task mixUI Components 18 · Data Management 10 · Gestures 8 · Media 7 · Networking 4 · Other 3
Task typefeature additions (not bug fixes)
Evaluationhosted-only (anti-contamination)

Benchmark composition. Task type is the most consequential difference from prior agent benchmarks: feature additions force the agent to build something, not just fix something.

Results

Twenty-two agent–model configurations were evaluated across four agents (Cursor, Codex, Claude Code, OpenCode) crossed with leading commercial and open models. Top of the leaderboard:

Agent + Model Tasks solved Tests passed
Cursor + Claude Opus 4.512.0%28.1%
Cursor + Claude Sonnet 4.512.0%26.7%
Codex + GLM 4.612.0%19.6%

Top of the SWE-Bench Mobile leaderboard. The three leaders are tied at 12% task pass rate but separated by 8.5 pp on tests — a coarse vs. fine-grained capability gap. Live results: swebenchmobile.com.

What we found

  • Same model, different agent — up to 6× spread. Scaffolding rivals model choice in importance.
  • Simple beats elaborate. A "Defensive Programming" prompt outperforms more elaborate prompting strategies by +7.4 pp.
  • The tests-passed column matters. Tasks that look "failed" at the binary level often pass a meaningful fraction of their tests — useful signal that's invisible if you only track pass@1.

The same model can vary up to 6× across agents. Reports that name only the LLM are missing half the story.

Why it matters

First production-grounded agent benchmark. Tasks come from a shipping app, not curated open-source issues. The codebase, the specs (PRD + Figma), and the tests are all real — which means a 12% top score is real, too, and probably overstates rather than understates how much engineering remains for agents to do.
The agent matters, not just the model. A 6× spread on the same model means the field's habit of evaluating "model X" as if the scaffold were transparent is misleading. Agent + model is the unit of comparison.
Hosted-only by design. Submissions run server-side so test sets never leak into training data — a deliberately uncomfortable but contamination-resistant template for industry benchmarks more broadly.