← Back to News Paper Release · KDD 2026 Main Conference (CCF-A) · February 2026

SWE-Bench Mobile: We Asked Coding Agents to Ship a Real iOS Feature. The Best Score Is 12%.

Accepted to KDD 2026's Applied Data Science Track at the main conference (CCF-A). Built with Xiaohongshu on a real, shipping iOS app.

Research Detail Project & Leaderboard arXiv
50real iOS feature tickets
449human-written tests
~500Klines of shipping code
12%best task pass rate

Can today's strongest coding agents actually ship a real mobile feature? We took 50 feature tickets straight out of Xiaohongshu's production iOS app — PRDs, Figma designs, ~500K lines of mixed Swift / Objective-C, and 449 human-written tests grading them — and pointed every leading agent × model combo at the work. The best score is 12%. The paper is now in KDD 2026's Applied Data Science Track at the main conference (CCF-A).

The takeaway: today's coding agents can complete meaningful parts of real production work, but independent delivery of industrial mobile features remains clearly out of reach. Even the strongest commercial agent + model combos clear only 12% of tasks under the strict all-tests-pass bar.
A SWE-Bench Mobile task: before/after screen, Figma reference, and the Swift code diff that has to be produced.
What one task looks like. A product change, a Figma reference, and the diff the agent has to land — graded against a hand-written test suite.

Why mobile is the honest hard test

Most coding-agent benchmarks ask a model to fix a bug it can see, against a test that already exists, inside a repo that may well have been in pretraining. Real mobile work is none of that. Tasks come in as feature tickets, not bug reports. The spec lives in a PRD and a Figma file, not a GitHub issue. The fix usually touches UI, data, interaction, feature flags, and engineering conventions at the same time. And the codebase is hundreds of thousands of lines that the model has not seen.

SWE-Bench Mobile puts all of that in one evaluation. The question it answers: can a complete agent system take a real product requirement and land it cleanly inside a production iOS app?

What's actually in it

Task sourceXiaohongshu's production iOS app
Task typeFeature additions, not bug fixes
Codebase size~500K lines of mixed Swift / Objective-C
Task inputsPRD + Figma design + reference images + codebase snapshot
Evaluation scale50 tasks, 449 human-written tests
Release formatHosted challenge with a public leaderboard
VenueKDD 2026 Main Conference (CCF-A)

The leaderboard, in one chart

Task success rate across 22 agent + model combos. The top three are tied at 12%; the bottom rows hover around 2%.
22 combos, three tied at 12%, a long tail in the low single digits. Live leaderboard.
Same model, four agents: Opus 4.5 goes from 12% in Cursor to 2% in OpenCode — a 6× spread.
Same model, four agents. The scaffold decides as much of the outcome as the model itself.

Five findings backed by the data

  • Same model, different agent — up to 6× spread. It's the scaffold, not just the LLM. Reports that name only the model are missing half the story.
  • Top score 12%. Top test pass rate 28.1%. Agents land partial wins, then trip over production details.
  • Simpler beats fancier. A plain "Defensive Programming" prompt outperforms more elaborate strategies by +7.4 pp on test pass rate.
  • Cross-module work remains the weakest link. Tasks touching 7+ files drop to 2% success. Localized changes are far easier.
  • What breaks the patch isn't the code — it's the conventions. Missing feature flags, half-built data models, the one file no one remembered, UI components that don't match the rest of the app.

Submit your stack to the leaderboard

SWE-Bench Mobile is a hosted challenge for coding-agent teams, foundation-model vendors, and mobile-development researchers. Submissions run server-side, so the test set never leaks into anyone's training data. What it offers is an evaluation coordinate system that stays close to real shipping mobile engineering.

Submit your agent to the leaderboard

The hosted challenge takes submissions today; one submission places you alongside every other team. Project page and public leaderboard are live; the paper is on arXiv and now in the KDD 2026 Main Conference proceedings (CCF-A).

Project & Leaderboard arXiv:2602.09540