GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same input, removing the need for a separate value network critic.
Refinement Step: A specific action where the model extracts and summarizes relevant information from retrieved documents into a concise format before reasoning.
Retrieval-Specific Reward: A reward signal given if the ground-truth answer strings appear within the model's refinement block, encouraging accurate information extraction.
Outcome-Based Reward: A reward signal based on the F1 score overlap between the final generated answer and the ground truth.
Search-during-think: A paradigm where models generate 'thought' tokens that can include calls to external search tools.
Multi-hop QA: Question answering tasks that require finding and connecting multiple pieces of evidence (e.g., distinct facts) to derive the final answer.
SFT: Supervised Fine-Tuning—training the model on labeled examples before applying RL.
Cover Exact Match: A metric measuring whether the generated text (document, refinement, or answer) contains the ground truth answer string.