Yanlan V3.0 is live — and it catches almost half the errors V2.0 missed
V3.0 sweeps all 8 headline metrics over V2.0 and takes 43 of 47 total comparisons.
Updates from Coolwei AI Lab: research published, products shipped, and moments worth recording.
V3.0 sweeps all 8 headline metrics over V2.0 and takes 43 of 47 total comparisons.
All 8 headline metrics won, 43 of 47 total comparisons won, setting a stronger bar for pre-publication Chinese correction.
Models are like students who grind problem sets but never check their paper: they solve, they don’t fix. ThinkTwice trains “checking” into a skill — an 11.5-point gain on AIME pass@4.
Engines are like master craftsmen who cannot teach: accurate, but unable to explain. Master Distillation gives a 4B model concise puzzle commentary that surpasses its teacher.
Rewriting official prose into plain language has no yardstick in most languages. Five languages, 9,519 sentences, written by native speakers — the first open evaluation for low-resource simplification.
50 real iOS feature tasks, 449 human-written tests, ~500K lines of production code, and a best task pass rate of 12%.
A report should read the same in any format; models often disagree with themselves. SEAM quantifies cross-modal inconsistency in 21 vision-language models across chess, molecules, scores, and graphs.
The start of a long-running collaboration to evaluate coding agents on real mobile production codebases.
Focused on safe deployment, evaluation, and real-world applications of large language models.
Like a teacher writing comments, Report Cards auto-write behavior reports for models — verified to genuinely help people tell models apart. A NeurIPS SoLaR Spotlight.