Research · 研究

Pushing the frontier of reasoning, code, and Chinese-language AI.

We work on agentic coding, multimodal evaluation, reasoning training, low-resource NLP, and model auditing, writing each result up as a bilingual explainer for researchers and general readers alike.

7 papers6 research themesUpdated June 2026

2026

5 papers

arXiv submitted · link incoming

SafeGEO

Paid search once pushed dubious hospitals in front of patients; today sellers can rewrite pages so AI favors them. SafeGEO measures the real scale of that risk — and the room for defense — across 600 recommendation cases and 22 attack variants.

AI Safety · Recommendation Agents · GEO

KDD 2026 (CCF-A) · with Xiaohongshu

SWE-Bench Mobile

50 real iOS feature tasks, 449 human-written tests, ~500K lines of production code, and a best score of 12%.

Coding Agents · Benchmark · Xiaohongshu

arXiv · 2026

OasisSimp

Turning “remitted prior to the commencement date” into “pay before the start date” decides whether public information is readable — yet outside English it barely had an evaluation. OasisSimp has native speakers write multi-reference simplifications for 9,519 sentences in five languages, all open.

Multilingual NLP · Dataset

arXiv · 2026

Grounded Chess Reasoning

Engines are like master craftsmen who cannot teach: accurate, but unable to explain. Master Distillation gets a 4B model to 48.1% puzzle accuracy, above most frontier models, with far shorter explanations.

Distillation · RLVR

arXiv · 2026

ThinkTwice

A student who never checks their paper won’t fix mistakes; neither do models trained the standard way. ThinkTwice trains solving and revision with one correctness reward, gaining 11.5 points on AIME pass@4.

RLVR · Self-refinement

2025

1 paper

COLM 2025

SEAM

A medical report should read the same in any format; yet the same question as text or image often gets different model answers. SEAM quantifies cross-modal consistency for 21 models across four domains and 9,600 evaluations.

Multimodal · Benchmark

2024

1 paper

NeurIPS SoLaR 2024 · Spotlight

Report Cards

Two students with the same total can have completely different weak spots — so can models. Report Cards write a teacher’s-comments-style report for each model, verified with contrastive accuracy, Card Elo, and human scoring.

Evaluation · Interpretability

Milestones

2026Five research lines released
- SafeGEO · GEO risk evaluation for recommendation agents
- SWE-Bench Mobile · KDD 2026
- ThinkTwice · self-refinement RLVR
- Grounded Chess Reasoning · Master Distillation
- OasisSimp · low-resource simplification dataset
August 2025SEAM accepted at COLM 2025
December 2024Report Cards receives NeurIPS SoLaR Spotlight