Agentic RAG: A RAG system where the LLM actively decides when to search, what to query, and when to answer, often in multiple rounds
Re2Search: A novel agent design proposed here that uses 'Reasoning, Reflection, and Search' to identify unverified claims before querying
Process Reward: Feedback given on intermediate steps (e.g., the quality of a search query) rather than just the final answer correctness
DPO: Direct Preference Optimization—an algorithm optimizing language models to prefer certain outputs over others using a contrastive loss, without a separate reward model
PPO: Proximal Policy Optimization—an RL algorithm that updates a policy using a clipped objective function to ensure stability
Critic: A model trained to estimate the value or quality of a state-action pair, used here to select the best intermediate reasoning/retrieval steps during inference
High-level MDP: A formulation where 'actions' are macro-steps like 'generate query' or 'give answer', rather than token-level generation
SFT: Supervised Fine-Tuning—training the model on high-quality demonstrations
F1 score: A metric measuring the overlap between the predicted answer and the ground truth answer