Models with similar averages can fail in completely different ways. Report Cards automatically write model behavior into reports — and verify the reports themselves.
Launch Highlights
- Problem:A single average score hides where a model succeeds, fails, and changes behavior.
- Method:Report Cards generate natural-language behavior summaries and evaluate them with contrastive, Elo, and human scoring.
- Finding:Strong reports compress many examples into evidence that helps people tell models apart.
- Why it matters:Evaluation results can feed product review, model choice, and safe deployment, rather than stopping at a leaderboard.
Continue reading
The research page covers the background, method, key figures, and paper links; for quick sharing, use the illustrated promo copy.