GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages of rollouts rather than a separate critic model
AnsF1: Answer-level F1 score—a metric calculating the harmonic mean of precision (valid answers / total predicted) and recall (matched references / total references)
rollout: A complete sequence of actions (reasoning, tool calls, answer generation) generated by the model during RL training
multi-hop QA: Questions requiring reasoning across multiple documents or steps to answer
agentic search: A setup where the model actively issues search queries and processes results in a multi-turn loop
trajectory sampling: Generating multiple different solution paths (trajectories) for a single question to explore possible answers
reference answer: The original 'gold' answer provided in the benchmark dataset
alternative answer: A valid answer different from the reference, discovered via the pipeline and verified by evidence
Exact Match: Evaluation metric checking if the predicted string exactly matches a ground truth string (after normalization)
LMJudge: Using a Large Language Model to evaluate if a predicted answer is semantically equivalent to the ground truth
entropy collapse: A failure mode in RL where the policy becomes deterministic too early, stopping exploration
tool-call: A specific action token sequence that triggers an external search engine