closed-ended question: A question with a single, verifiable, unambiguous answer (e.g., multiple choice or short exact match), as opposed to open-ended essays.
exact-match: An evaluation method where the model's output must character-for-character match the ground truth (e.g., a specific number or chemical formula).
MMLU: Massive Multitask Language Understanding—a popular benchmark covering 57 subjects that current models have effectively 'solved' (>90% accuracy).
calibration error: A measure of how well a model's predicted confidence aligns with its actual accuracy (e.g., if it says 90% confident, is it right 90% of the time?).
RMS calibration error: Root Mean Square calibration error—a specific metric quantifying the deviation between confidence and accuracy; high values mean the model is poorly calibrated (over/under-confident).
multi-modal: Involving multiple types of data input; here, questions that combine text with images (e.g., diagrams, charts, inscriptions).
reasoning models: Models trained to generate internal 'chains of thought' (intermediate reasoning steps) before producing a final answer (e.g., OpenAI o1, DeepSeek-R1).
hallucination: When an LLM confidently generates incorrect or fabricated information.
saturation: When a benchmark becomes too easy for current models (scores near 100%), rendering it useless for distinguishing between top models.