https://arxiv.org/abs/2501.14249

📝 Paper Summary

LLM Benchmarking Reasoning Evaluation

HLE is a new, expert-curated multi-modal benchmark designed to be the final closed-ended academic exam for LLMs, containing problems that currently stump frontier models.

Core Problem

Existing academic benchmarks like MMLU are saturated, with LLMs achieving >90% accuracy, making it impossible to distinguish capabilities at the frontier of human knowledge.

Why it matters:

Saturation obscures the true gap between AI and expert human capabilities, hindering informed research and policy decisions.
Current benchmarks often rely on retrievable information rather than deep reasoning, allowing models to memorize answers rather than solve problems.

Concrete Example: Frontier models achieve near-perfect scores on MMLU, yet fail on HLE questions like translating Palmyrene script or solving specific chemical reaction cascades (see Figure 2), showing they lack true expert proficiency.

Key Novelty

Expert-Sourced Frontier Hardness

Filters questions via a 'negative check' against current LLMs: only questions that state-of-the-art models fail to answer are accepted.
Crowdsources content from ~1,000 subject-matter experts (mostly PhDs/graduates) rather than generic annotators to ensure depth and precision.

Architecture

The dataset creation pipeline from submission to release.

Evaluation Highlights

Current state-of-the-art models (including o3-mini and DeepSeek-R1) achieve <15% accuracy on HLE, compared to >90% on MMLU.
Models exhibit severe overconfidence, with RMS calibration errors >70%, indicating they do not know when they are hallucinating answers to hard questions.
Reasoning models like o1 require significantly more tokens to achieve marginal accuracy gains over non-reasoning models.

Breakthrough Assessment

9/10

Sets a new standard for difficulty in LLM evaluation, effectively uncapping the measurement scale just as MMLU becomes obsolete. Likely to be the primary target for next-gen models.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal, closed-ended question answering across diverse academic subjects.

Inputs: Text question q, optional image I, optional context/constraints.

Outputs: Answer A (either selection from multiple choice or exact string match).

Pipeline Flow

Submission by Expert
Automated LLM Filtering (Difficulty Check)
Peer Review (Round 1)
Organizer/Expert Approval (Round 2)
Final Benchmark Assembly

System Modules

Difficulty Check

Reject questions that current LLMs can already solve

Model or implementation: Ensemble of Frontier LLMs (GPT-4o, Claude 3.5 Sonnet, etc.)

Expert Review

Validate correctness, unambiguity, and non-searchability

Model or implementation: Human Experts (PhD/Graduate level)

Novel Architectural Elements

Adversarial filtering pipeline: The dataset construction explicitly integrates the models it intends to evaluate as a filter—if the model solves it, the data is discarded.

Modeling

Base Model: N/A (This is a dataset paper, but it evaluates models like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, o1, DeepSeek-R1)

Training Method: Not applicable (Dataset paper)

Comparison to Prior Work

vs. MMLU: HLE is explicitly adversarial to current models (0-10% baseline vs 90%) and multi-modal
vs. GPQA: HLE covers broader subjects (humanities, etc.) and includes image-based questions
vs. MATH: HLE includes advanced graduate-level math and non-math subjects

Limitations

Benchmark is designed to be temporary; as models improve, accuracy will rise and eventually saturate again.
Focuses on closed-ended academic knowledge, ignoring open-ended research or creative capabilities.
Difficulty filtering might artificially skew the distribution towards 'trick' questions or extremely niche trivia (though review attempts to mitigate this).
Evaluation relies on o3-mini as a judge for exact match verification, which may introduce minor grading noise.

Reproducibility

Code: https://lastexam.ai

The public portion of the dataset (2,500 questions) is released at lastexam.ai. A private test set is held back to prevent contamination. Evaluation prompts are detailed in the appendix. Exact model versions used for baselines are listed.

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot inference on 2,500 hard questions.

Benchmarks:

Humanity's Last Exam (HLE) (Multi-disciplinary academic QA) [New]

Metrics:

Accuracy (%)
RMS Calibration Error (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of state-of-the-art non-reasoning models on HLE is extremely low compared to saturated benchmarks like MMLU.
HLE	Accuracy	0.0	2.7	+2.7
HLE	Accuracy	0.0	4.1	+4.1
HLE	Accuracy	0.0	4.6	+4.6
Reasoning models perform better but still fail to achieve respectable scores.
HLE	Accuracy	2.7	8.0	+5.3
HLE (Text-Only Subset)	Accuracy	0.0	13.4	+13.4
Models are severely miscalibrated, showing high confidence despite low accuracy.
HLE	RMS Calibration Error	0	89	+89
HLE	RMS Calibration Error	0	83	+83

Experiment Figures

Comparison of model accuracy on HLE vs. existing benchmarks (MMLU, MATH, etc.).

Average completion token usage by reasoning models across different subjects.

Main Takeaways

No current model achieves >15% accuracy, verifying the dataset's difficulty targeting the expert frontier.
Reasoning models (o1, DeepSeek-R1) outperform standard models (GPT-4o, Claude 3.5), confirming that test-time compute helps, but they still fail nearly 90% of questions.
Calibration is a major failure mode: models rarely admit ignorance, instead confidently answering incorrectly.
Token usage correlates with performance: reasoning models use significantly more inference tokens to achieve their marginal gains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM evaluation methodologies (benchmarks, saturation)
Familiarity with calibration metrics (expected calibration error/RMS)
Basic knowledge of token-based reasoning (Chain-of-Thought)

Key Terms

closed-ended question: A question with a single, verifiable, unambiguous answer (e.g., multiple choice or short exact match), as opposed to open-ended essays.

exact-match: An evaluation method where the model's output must character-for-character match the ground truth (e.g., a specific number or chemical formula).

MMLU: Massive Multitask Language Understanding—a popular benchmark covering 57 subjects that current models have effectively 'solved' (>90% accuracy).

calibration error: A measure of how well a model's predicted confidence aligns with its actual accuracy (e.g., if it says 90% confident, is it right 90% of the time?).

RMS calibration error: Root Mean Square calibration error—a specific metric quantifying the deviation between confidence and accuracy; high values mean the model is poorly calibrated (over/under-confident).

multi-modal: Involving multiple types of data input; here, questions that combine text with images (e.g., diagrams, charts, inscriptions).

reasoning models: Models trained to generate internal 'chains of thought' (intermediate reasoning steps) before producing a final answer (e.g., OpenAI o1, DeepSeek-R1).

hallucination: When an LLM confidently generates incorrect or fabricated information.

saturation: When a benchmark becomes too easy for current models (scores near 100%), rendering it useless for distinguishing between top models.