← Back to Paper List

SimpleQA: Measuring Short-form Factuality in LLMs

(OpenAI) Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus
OpenAI
arXiv, 11/2024 (2024)
Factuality Benchmark QA

📝 Paper Summary

Factuality Evaluation Hallucination Suppression
SimpleQA is a benchmark of 4,326 short, fact-seeking questions with single indisputable answers, designed to measure whether frontier language models can correctly answer or refuse to answer when unsure.
Core Problem
Measuring factuality in language models is difficult because evaluating arbitrary long-form claims is challenging and open-ended, making it hard to distinguish hallucinations from stylistic differences.
Why it matters:
  • Current frontier models frequently produce hallucinations (false outputs not substantiated by evidence), barring broader adoption of AI
  • Existing benchmarks like TriviaQA and Natural Questions are now saturated (too easy) for modern models
  • Evaluation of long-form responses is intractable; reducing scope to short, verifiable facts allows for precise measurement
Concrete Example: When asked 'Where did Barack and Michelle Obama meet?', a model might answer 'Chicago' or 'Sidley & Austin'. Without strict scoping criteria (e.g., 'which city'), evaluating the correctness of such open-ended answers is difficult and prone to noise.
Key Novelty
SimpleQA: Adversarial, Short-Form Factuality Benchmark
  • Restricts evaluation to short, fact-seeking questions with a SINGLE, INDISPUTABLE answer (e.g., dates, names) to ensure high grading reliability
  • Uses adversarial collection where human trainers specifically wrote questions that tricked GPT-4, ensuring the benchmark is challenging for frontier models
  • Evaluates not just accuracy but 'knowing what you know' by grading answers as Correct, Incorrect, or Not Attempted, penalizing hallucinations
Evaluation Highlights
  • Frontier models GPT-4o and Claude-3.5 Sonnet both score less than 40% correct on SimpleQA, confirming the benchmark's difficulty
  • Large models (o1-preview) are better calibrated than smaller ones (o1-mini), showing a stronger correlation between answer frequency and correctness
  • Models consistently overstate their confidence: even when stating high confidence, actual accuracy falls well below the ideal y=x calibration line
Breakthrough Assessment
8/10
Highly practical benchmark that solves the 'gradeability' crisis in factuality evals. While methodologically simple, its adversarial nature and focus on calibration provide a standard standard for the next generation of models.
×