Agentic RAG: Retrieval-Augmented Generation systems enhanced with autonomous agents that perform reflection, planning, and multi-step reasoning
MMLU: Measuring Massive Multitask Language Understanding—a benchmark testing zero-shot and few-shot performance across 57 subjects
MCP: Model Context Protocol—a standard for connecting AI assistants to systems where data lives
ACP: Agent Communication Protocol—mechanisms allowing distinct agents to exchange messages and coordinate
Humanity's Last Exam (HLE): A 2025 benchmark with 3,000 expert-level questions designed to be resistant to simple retrieval, where SOTA models fail significantly
Agent-as-a-Judge: An evaluation framework where an AI agent evaluates the outputs of other agents, often offering granular feedback cheaper than human evaluation
ProcessBench: A benchmark for detecting errors in the reasoning steps of mathematical problem solving
Fact Grounding: The ability of an LLM to base its responses strictly on provided source documents, minimizing hallucination
Hallucination: When an LLM generates plausible-sounding but factually incorrect information