← Research Benchmark · KDD 2026 Main Conference (CCF-A) · February 2026

SWE-Bench Mobile

Coding agents placed inside a real production iOS engineering workflow reveal a clear capability boundary. Accepted to KDD 2026 — Applied Data Science Track, Main Conference (CCF-A).

arXiv Project & leaderboard
50tasks
449test cases
22agent–model configs
~500Klines of production code
12%top score (tasks solved)

SWE-Bench Mobile moves agent-coding evaluation out of open-source GitHub repos and into a real, shipping mobile product. Fifty engineering tasks from Xiaohongshu's production iOS app — each with its original PRD, Figma design, and a hand-written test suite — ask agents to read multi-modal specs and work inside a roughly 500K-line mixed Swift/Objective-C codebase the way a real iOS engineer does.

The results draw a clear capability boundary: even the strongest commercial agent + model combos solve only 12%. And which agent you pick matters as much as which model — same model, different scaffold (the agent framework that drives the model to read code, edit, and run tests), pass rate can swing up to . Much as the same chef cooks differently in a different kitchen — the stove and the workflow matter as much as the craft. The paper is now in KDD 2026's Applied Data Science Track at the main conference (CCF-A).

SWE-Bench Mobile pipeline: input (PRD + Figma + codebase) → runtime (agent + model) → evaluation (entrypoint, functionality, configuration)
The end-to-end loop. A real product ticket flows in as a PRD plus Figma plus codebase snapshot; the agent writes a patch; the patch is graded by entrypoint, functionality, and configuration evaluators against a hand-written test suite.

Why we built it

Prior agent-coding benchmarks miss real-world engineering in four ways at once. Open-source repos may leak into pretraining; tasks are bug fixes rather than new features; specs are GitHub issues rather than design documents; and the tests usually already exist. SWE-Bench Mobile reverses all four. The codebase is a real production iOS app, the tasks are feature additions backed by PRDs and Figma designs, and the evaluation harness is hosted-only — keeping test sets out of training data by design.

In one sentence: SWE-Bench Mobile asks whether an agent can understand a real product requirement, read the design, locate the right modules, and land a coherent patch in a production mobile app.

What's in the benchmark

SourceXiaohongshu (Little Red Book) production iOS app
LanguagesSwift + Objective-C (mixed)
Tasks50
Test cases449 (~9 per task)
Codebase size~500K lines of code
Inputs per taskPRD + Figma design + codebase snapshot (multi-modal)
Outputunified diff
Visual assets35 tasks include Figma designs; 46 include reference images
Task mixUI Components 18 · Data Management 10 · Gestures 8 · Media 7 · Networking 4 · Other 3
Task typefeature additions (not bug fixes)
Evaluationhosted-only (anti-contamination)

Benchmark composition. Task type is the most consequential difference from prior agent benchmarks: feature additions force the agent to build something, not just fix something.

Task distribution by category and difficulty: 18 UI Components, 10 Data Mgmt, 8 Gesture, 7 Media, 4 Networking, 3 Other; 15 Easy, 25 Medium, 10 Hard.
Task mix. UI components, data management and gesture interactions cover most of the set, with a calibrated easy/medium/hard split.
An example task: a before/after of the screen, the Figma reference, and the resulting code diff in a Swift file.
A representative task. Each one comes with a before/after screenshot, the Figma reference, and a ground-truth code change — the agent has to produce a diff that survives the test suite.

Results

Twenty-two agent–model configurations were evaluated across four agents (Cursor, Codex, Claude Code, OpenCode) crossed with leading commercial and open models. Top of the leaderboard:

Agent + Model Tasks solved Tests passed
Cursor + Claude Opus 4.512.0%28.1%
Cursor + Claude Sonnet 4.512.0%26.7%
Codex + GLM 4.612.0%19.6%

Top of the SWE-Bench Mobile leaderboard. The three leaders are tied at 12% task pass rate but separated by 8.5 pp on tests — a coarse vs. fine-grained capability gap. Live results: swebenchmobile.com.

Bar chart of task success rate across 22 agent + model combos; Cursor + Opus 4.5, Cursor + Sonnet 4.5, and Codex + GLM 4.6 tie at 12% at the top.
Full leaderboard. Twenty-two combos, three tied at the top, a long tail in the low single digits. The ceiling is the headline; the floor is the warning.
Grouped bars showing the same model in different agents. Opus 4.5 ranges from 12% in Cursor to 2% in OpenCode — a 6× spread.
Same model, four agents. Opus 4.5 goes from 12% (Cursor) to 2% (OpenCode). The scaffold decides as much of the outcome as the model itself.

What we found

  • Same model, different agent — up to 6× spread. Scaffolding rivals model choice in importance.
  • Simple beats elaborate. A "Defensive Programming" prompt outperforms more elaborate prompting strategies by +7.4 pp.
  • The tests-passed column matters. Tasks that look "failed" at the binary level often pass a meaningful fraction of their tests — useful signal that's invisible if you only track pass@1.
  • Complex engineering remains hard. Tasks touching 7+ files fall to 2% success, while small localized changes are far easier.
  • Production deployment practices trip agents up. Common failures include missing feature flags, data models, files, UI components, and required methods.

The same model can vary up to 6× across agents. Reports that name only the LLM are missing half the story.

Two bar charts: success rate drops with number of files modified (1-2 ~18%, 7+ ~2%) and with patch size.
Where success rates drop fastest. The more files or lines a patch has to span, the lower the success rate. Cross-module work is the unsolved part.
Heatmap of task success rate by task category vs agent. Data management and UI components are highest; gesture is lowest.
Category × agent heatmap. Each agent has its own shape of strength; none is uniformly best, none is uniformly worst.
Bar chart showing variance across multiple runs: CC + Opus 4.5 averages 6.7% (σ=1.15), Codex + Opus 4.5 averages 4.0% (σ=0).
Stability across reruns. Numbers move between runs, but ordering between agent + model combos stays stable enough to compare.

Why it matters

First production-grounded agent benchmark. Tasks come from a shipping app, not curated open-source issues. The codebase, the specs (PRD + Figma), and the tests are all real — which means a 12% top score is real, too, and probably overstates rather than understates how much engineering remains for agents to do.
The agent matters, not just the model. A 6× spread on the same model means the field's habit of evaluating "model X" as if the scaffold were transparent is misleading. Agent + model is the unit of comparison.
Hosted-only by design. Submissions run server-side so test sets never leak into training data — a deliberately uncomfortable but contamination-resistant template for industry benchmarks more broadly.

Signal for the field

SWE-Bench Mobile is not a pessimistic result. A 12% strict task success rate shows that today's agents are still far from autonomous mobile engineers; a 28.1% top test-pass rate shows they already do meaningful partial work inside real code. The practical read is sharper: coding agents are useful copilots, but not yet ready to independently own complex mobile feature delivery.

Evaluate on SWE-Bench Mobile

Hosted challenge. Public leaderboard. Submit your agent + model and see where you land on real production iOS work. Paper now in KDD 2026 Main Conference (CCF-A).

Project & leaderboard Read the paper