Report Cards receives NeurIPS SoLaR Spotlight - News

Models with similar averages can fail in completely different ways. Report Cards automatically write model behavior into reports — and verify the reports themselves.

Launch Highlights

Problem：A single average score hides where a model succeeds, fails, and changes behavior.
Method：Report Cards generate natural-language behavior summaries and evaluate them with contrastive, Elo, and human scoring.
Finding：Strong reports compress many examples into evidence that helps people tell models apart.
Why it matters：Evaluation results can feed product review, model choice, and safe deployment, rather than stopping at a leaderboard.

Continue reading

The research page covers the background, method, key figures, and paper links; for quick sharing, use the illustrated promo copy.

Open research page Open promo copy

Report Cards: writing model behavior into verifiable reports

Launch Highlights

Continue reading