GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs for the same input, eliminating the need for a critic model
ISDD: Importance Sampling Distribution Drift—a phenomenon where the current policy deviates so far from the old policy that importance sampling weights vanish, zeroing out gradients
SAPO: Search Agent Policy Optimization—the proposed method adding a conditional KL penalty to GRPO to fix ISDD
positive tokens: Tokens that belong to a trajectory with a positive advantage value (i.e., better than the group average)
hard clipping: The standard PPO mechanism that clips the importance ratio to [1-ε, 1+ε] to limit update size, which the authors argue is insufficient for ISDD
External retrieval tokens: Tokens representing the content returned by the search tool, which are masked during training so the agent isn't penalized for tool outputs
Exact Match (EM): Evaluation metric checking if the generated answer string exactly matches the ground truth
F1 score: Reward metric measuring overlap between prediction and ground truth, used here as the outcome-based reward signal