← Back to Research

Report Cards

Qualitative Evaluation of Language Models

NeurIPS 2024 SoLaR Workshop Spotlight | Best Paper Nomination

Paper Details

Liwei Yang

First Author

Jimmy Ba

Corresponding Author

Other Collaborators

University of Toronto

Abstract

Traditional quantitative benchmarks struggle to accurately assess the true capabilities of large language models. We propose Report Cards— a novel evaluation framework that generates human-interpretable natural language summaries for specific skills. Based on three core dimensions: Specificity, Faithfulness, and Interpretability, this framework employs a fully automated iterative algorithm to generate deep, reliable model behavior analysis reports without human supervision.

Core Methodology

S

Specificity

Generated evaluation reports must precisely describe model performance on specific tasks, avoiding vague descriptions.

F

Faithfulness

Evaluation results must truly reflect actual model behavior, verified through adversarial testing for report accuracy.

I

Interpretability

Generated reports must be easily understood by humans, providing clear insights rather than technical jargon.

Key Innovation

First fully automated qualitative evaluation framework
Surpasses traditional quantitative benchmarks, providing deep behavioral insights
Establishes a new paradigm for large language model evaluation

Experimental Results & Impact

Performance Improvements

Compared to traditional evaluation methods, Report Cards can discover more fine-grained model behavior patterns, providing more valuable insights than numerical scores.

Wide Applicability

The framework has been validated on mainstream large models including GPT series, Claude, LLaMA, etc., demonstrating excellent generalization capabilities.

Future Outlook

This work opens new directions for model evaluation and is expected to become the standard method for next-generation AI system evaluation.

Learn More

Read Paper View Code Contact Authors