GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-wise relative rewards
FictionalHot: A new benchmark created by replacing real entities in multi-hop questions with fictional ones and inserting synthetic documents into the corpus to test reasoning without data contamination
JUDGE action: A special agent action introduced in ReSeek that triggers a self-evaluation step to assess the utility of retrieved information
Process Reward: A dense reward signal given at intermediate steps of reasoning, rather than just at the final outcome
Reranker: A model component that scores the relevance of retrieved documents to the query; used here to calculate the utility reward
Closed-world evaluation: Testing where the agent can only use provided external knowledge sources (e.g., a fixed Wikipedia dump) and not its internal pre-trained knowledge
Exact Match (EM): A metric that counts a prediction as correct only if it matches the ground truth answer string exactly after normalization
SFT: Supervised Fine-Tuning—training the model on labeled demonstrations before applying RL