Pareto frontier: The set of solutions where no individual metric (e.g., accuracy) can be improved without degrading another (e.g., cost)
System 2: In this context, agent architectures that use deliberate planning, reflection, or debugging steps, as opposed to direct 'System 1' generation
DSPy: A framework for algorithmically optimizing LM prompts and pipelines
pass@k: A metric measuring the probability that at least one of k generated code samples passes all unit tests
variable cost: The cost incurred per run of an agent (input/output tokens), which grows linearly with usage
fixed cost: One-time cost for optimizing an agent's design (e.g., searching for prompts/few-shot examples)
stochasticity: The randomness in model outputs; agents often exploit this by sampling multiple times to find a correct answer
Optuna: An automatic hyperparameter optimization software framework used here to find optimal agent configurations
overfitting: When an agent performs well on a specific benchmark due to memorization or shortcuts but fails to generalize to new, similar tasks