We work on agentic coding, multimodal evaluation, reasoning training, low-resource NLP, and model auditing, writing each result up as a bilingual explainer for researchers and general readers alike.
arXiv submitted · link incoming
Paid search once pushed dubious hospitals in front of patients; today sellers can rewrite pages so AI favors them. SafeGEO measures the real scale of that risk — and the room for defense — across 600 recommendation cases and 22 attack variants.
AI Safety · Recommendation Agents · GEO
KDD 2026 (CCF-A) · with Xiaohongshu
50 real iOS feature tasks, 449 human-written tests, ~500K lines of production code, and a best score of 12%.
Coding Agents · Benchmark · Xiaohongshu
arXiv · 2026
Turning “remitted prior to the commencement date” into “pay before the start date” decides whether public information is readable — yet outside English it barely had an evaluation. OasisSimp has native speakers write multi-reference simplifications for 9,519 sentences in five languages, all open.
Multilingual NLP · Dataset
arXiv · 2026
Engines are like master craftsmen who cannot teach: accurate, but unable to explain. Master Distillation gets a 4B model to 48.1% puzzle accuracy, above most frontier models, with far shorter explanations.
Distillation · RLVR
arXiv · 2026
A student who never checks their paper won’t fix mistakes; neither do models trained the standard way. ThinkTwice trains solving and revision with one correctness reward, gaining 11.5 points on AIME pass@4.
RLVR · Self-refinement