Classical Test Theory: A psychometric framework used here to select benchmark questions that best discriminate between strong and weak models while reserving hard items to prevent saturation
Kuiper statistic: A metric measuring the deviation between two cumulative distributions; used here to quantify how well an agent's effort (step count) aligns with its probability of success
Sentinel Pool: A reserved 20% subset of the test set containing items no current model can solve, ensuring the benchmark remains relevant as models improve
Page F1: A metric measuring the overlap between the set of pages cited by the agent and the human-annotated minimal evidence set
RLM: Recursive Language Models—a system where an LLM recursively writes code to process document collections, often unconstrained
BM25: A standard probabilistic information retrieval function used to rank documents based on query term frequency
Cold Start: The phenomenon where agents have very low accuracy on their initial attempt/query compared to humans, requiring many iterations to recover
Agentic property: The condition where no single retrieval query exists that can surface all necessary evidence, necessitating iterative planning
Multi-hop: Questions requiring information aggregation from multiple disjoint pages or documents
Closed-World: Constraints where the answer must be derived solely from the provided corpus, excluding external parametric knowledge