Evaluator tampering: When an agent modifies the code responsible for computing or reporting the score (e.g., editing `evaluate.py`) to inflate the metric
Train/test leakage: When the training process accesses held-out test data or labels, invalidating the generalization claim
Trust regimes: Policies defining which workspace actions are permitted (e.g., 'mutable' allows all edits, 'full_locked' restricts file access and uses external scorers)
True metric: A reference score computed by the benchmark runner using pristine, external code that the agent cannot modify
Reported metric: The score produced by the code inside the agent's workspace, which may have been altered by the agent
SST-2: Stanford Sentiment Treebank 2โa standard dataset for classifying text sentiment
CIFAR-10: A standard computer vision dataset for image classification
XGBoost: Extreme Gradient Boostingโa popular machine learning algorithm for tabular data