RLVR: Reinforcement Learning with Verifiable Rewards—optimizing a model using only the correctness of the final answer (e.g., math or code execution) without human preference labels
Cold-start SFT: The initial supervised fine-tuning phase used to teach a model basic instruction-following and formatting before reinforcement learning begins
ReAct: Reasoning and Acting—a prompting framework where models generate a 'Thought' (reasoning trace) before emitting an 'Action' (tool call)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input, eliminating the need for a separate value network
Trajectory: The sequence of thoughts, tool actions, and observations generated by the agent to solve a single problem
WebWalkerQA: A benchmark for evaluating web agents that navigate and extract information from websites
GPQA: A challenging QA dataset written by domain experts (biology, physics, chemistry) difficult for non-experts to answer
GAIA: A benchmark for General AI Assistants evaluating reasoning, tool use, and multi-modality
Policy Entropy: A measure of the randomness in the agent's actions; higher entropy implies more exploration and less certainty/collapse on a single behavior
On-policy: RL methods where the data used for training is generated by the current version of the policy itself