Pareto frontier: The set of optimal solutions where no individual criterion (e.g., cost vs. capability) can be improved without compromising another
SoTA: State-of-the-Art—the highest level of performance currently achieved
Agentic workflows: Processes where an AI system autonomously plans, uses tools, and executes multiple steps to achieve a goal
GPQA: Graduate-Level Google-Proof Q&A—a challenging benchmark for reasoning
SWE-bench: Software Engineering Benchmark—evaluates LLMs on resolving real-world GitHub issues
Aider Polyglot: A benchmark evaluating coding performance across multiple programming languages
Humanity's Last Exam: A highly difficult, expert-constructed benchmark designed to be resistant to current AI capabilities
Long Context: The ability of a model to process very large amounts of input data (tokens) at once, such as entire books or long videos