Qualitative Evaluation of Language Models
NeurIPS 2024 SoLaR Workshop Spotlight | Best Paper Nomination
Traditional quantitative benchmarks struggle to accurately assess the true capabilities of large language models. We propose Report Cards— a novel evaluation framework that generates human-interpretable natural language summaries for specific skills. Based on three core dimensions: Specificity, Faithfulness, and Interpretability, this framework employs a fully automated iterative algorithm to generate deep, reliable model behavior analysis reports without human supervision.
Generated evaluation reports must precisely describe model performance on specific tasks, avoiding vague descriptions.
Evaluation results must truly reflect actual model behavior, verified through adversarial testing for report accuracy.
Generated reports must be easily understood by humans, providing clear insights rather than technical jargon.
First fully automated qualitative evaluation framework
Surpasses traditional quantitative benchmarks, providing deep behavioral insights
Establishes a new paradigm for large language model evaluation
Compared to traditional evaluation methods, Report Cards can discover more fine-grained model behavior patterns, providing more valuable insights than numerical scores.
The framework has been validated on mainstream large models including GPT series, Claude, LLaMA, etc., demonstrating excellent generalization capabilities.
This work opens new directions for model evaluation and is expected to become the standard method for next-generation AI system evaluation.