SimpleQA: Measuring Short-form Factuality in LLMs

📝 Paper Summary

Factuality Evaluation Hallucination Suppression

SimpleQA is a benchmark of 4,326 short, fact-seeking questions with single indisputable answers, designed to measure whether frontier language models can correctly answer or refuse to answer when unsure.

Core Problem

Measuring factuality in language models is difficult because evaluating arbitrary long-form claims is challenging and open-ended, making it hard to distinguish hallucinations from stylistic differences.

Why it matters:

Current frontier models frequently produce hallucinations (false outputs not substantiated by evidence), barring broader adoption of AI
Existing benchmarks like TriviaQA and Natural Questions are now saturated (too easy) for modern models
Evaluation of long-form responses is intractable; reducing scope to short, verifiable facts allows for precise measurement

Concrete Example: When asked 'Where did Barack and Michelle Obama meet?', a model might answer 'Chicago' or 'Sidley & Austin'. Without strict scoping criteria (e.g., 'which city'), evaluating the correctness of such open-ended answers is difficult and prone to noise.

Key Novelty

SimpleQA: Adversarial, Short-Form Factuality Benchmark

Restricts evaluation to short, fact-seeking questions with a SINGLE, INDISPUTABLE answer (e.g., dates, names) to ensure high grading reliability
Uses adversarial collection where human trainers specifically wrote questions that tricked GPT-4, ensuring the benchmark is challenging for frontier models
Evaluates not just accuracy but 'knowing what you know' by grading answers as Correct, Incorrect, or Not Attempted, penalizing hallucinations

Evaluation Highlights

Frontier models GPT-4o and Claude-3.5 Sonnet both score less than 40% correct on SimpleQA, confirming the benchmark's difficulty
Large models (o1-preview) are better calibrated than smaller ones (o1-mini), showing a stronger correlation between answer frequency and correctness
Models consistently overstate their confidence: even when stating high confidence, actual accuracy falls well below the ideal y=x calibration line

Breakthrough Assessment

8/10

Highly practical benchmark that solves the 'gradeability' crisis in factuality evals. While methodologically simple, its adversarial nature and focus on calibration provide a standard standard for the next generation of models.

⚙️ Technical Details

Problem Definition

Setting: Short-form Fact-seeking Question Answering

Inputs: Short question q asking for a specific fact (date, name, number)

Outputs: Short answer string a or refusal to answer

Pipeline Flow

Question Generation (Human AI Trainers)
Adversarial Filtering (Check against GPT-4)
Verification (2nd Independent Trainer)
Autograding (ChatGPT Classifier)

System Modules

Human AI Trainers (Data Creation)

Create questions with single, timeless, evidence-backed answers

Model or implementation: Human Annotators

Adversarial Filter (Data Creation)

Ensure questions are challenging by checking if models fail them

Model or implementation: GPT-4 / GPT-3.5

Verifier

Independently answer questions to ensure single indisputable answer

Model or implementation: Human Annotators (Second Trainer)

Grader

Grade model responses against reference answers

Model or implementation: ChatGPT (Prompted Classifier)

Novel Architectural Elements

Adversarial data collection pipeline targeting specific model failures (GPT-4) to prevent benchmark saturation
Strict 'single indisputable answer' constraint enforced by double-blind human verification

Modeling

Base Model: Evaluated multiple models: gpt-4o, gpt-4o-mini, o1-preview, o1-mini, claude-3-5-sonnet, claude-3-opus, claude-3-haiku

Comparison to Prior Work

vs. TriviaQA/Natural Questions: SimpleQA is adversarially collected to be harder (uncorrelated with training data frequency) and ensures single indisputable answers
vs. LongFact: SimpleQA focuses on short-form answers to isolate factuality from writing style and enable easier grading
vs. FreshQA: SimpleQA focuses on 'evergreen' facts (timeless answers) rather than changing news

Limitations

Limits factuality measurement to short, simple answers; may not correlate with long-form factuality capabilities
F-score metric incentivizes guessing if model confidence is >50%
Small error rate (~3%) remains due to ambiguous questions or contradictory sources despite verification

Reproducibility

Code: https://github.com/openai/simple-evals

publicly available (https://github.com/openai/simple-evals). Dataset of 4,326 questions is released. Grading prompt is provided in Appendix A. Code for running the eval is in the repo.

📊 Experiments & Results

Evaluation Setup

Zero-shot question answering with autograding

Benchmarks:

SimpleQA (Short-form Fact-seeking QA) [New]

Metrics:

Overall Correct (%)
Correct Given Attempted (%)
F-score (harmonic mean of Correct and Correct Given Attempted)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of OpenAI models shows scaling behavior, with larger models outperforming smaller ones.
SimpleQA	Overall Correct	8.6	38.2	+29.6
SimpleQA	Overall Correct	24.7	41.6	+16.9
Performance of Anthropic models; Claude 3.5 Sonnet performs competitively but attempts fewer questions.
SimpleQA	Overall Correct	6.8	24.4	+17.6
SimpleQA	Not Attempted	12.8	41.6	+28.8

Experiment Figures

Calibration plot: Stated Confidence (x-axis) vs. Actual Accuracy (y-axis) for 4 OpenAI models

Calibration plot: Frequency of Same Answer (x-axis) vs. Accuracy (y-axis) over 100 samples

Main Takeaways

SimpleQA is challenging: even top models (o1-preview, gpt-4o) score below 45% correct, while older benchmarks are saturated.
Larger models are better calibrated: o1-preview and gpt-4o show better correlation between confidence and accuracy than their mini counterparts.
Models are generally overconfident: stated confidence consistently exceeds actual accuracy (performance below y=x line).
Claude models exhibit different behavior, attempting significantly fewer questions (higher refusal rates) compared to GPT-4o models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) hallucinations
Familiarity with QA benchmarks (TriviaQA, NQ)
Basic probability concepts for calibration (confidence vs accuracy)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Hallucination: When a language model generates false information or answers not substantiated by evidence

Calibration: The alignment between a model's predicted confidence (e.g., 'I am 80% sure') and its actual accuracy (being correct 80% of the time)

Adversarial collection: Data collection method where annotators specifically create examples that cause a target model (here, GPT-4) to fail

F-score: In this paper, the harmonic mean of 'overall correct' percentage and 'correct given attempted' percentage

Frontier models: The most capable, state-of-the-art large language models currently available (e.g., GPT-4o, Claude 3.5 Sonnet)

Saturated benchmarks: Benchmarks where model performance has reached a ceiling (e.g., near human level), rendering them less useful for distinguishing between new, better models

AI trainers: Human annotators employed to create data and grade model outputs

Temperature 1: A sampling parameter for LLMs; setting it to 1 introduces randomness, allowing the model to generate different answers on repeated attempts

String match: Comparing two text strings character-by-character to see if they are identical