GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies using group-based relative advantages without a separate value network
Exact Match (EM): A metric and reward signal checking if the generated answer string exactly matches the ground truth (after normalization)
Recall Reward: A reward signal based on whether the retrieved documents contain the necessary information/answer
Deficient Search: Problematic behaviors defined by the authors: No Search (skipping retrieval), Duplicate Queries, or Invalid Searches (malformed syntax)
Credit Assignment: The problem in RL of determining which past action is responsible for a current reward
DeSA: Decoupling Search and Answering—the proposed two-stage training framework
SFT: Supervised Fine-Tuning—training on labeled data, often used as a starting point before RL
E5: A dense retrieval model used to fetch relevant passages based on semantic similarity