RAG: Retrieval-Augmented Generation—systems that retrieve external documents to ground LLM answers.
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same input, removing the need for a separate critic model.
ReSearch: The proposed framework: Learning to Reason with Search.
Rollout: A single complete generation sequence produced by the model during RL training, including thinking, search queries, and results.
Exact Match: A metric checking if the generated answer string exactly matches the ground truth string.
LLM-as-a-judge: Using a strong LLM (like GPT-4) to evaluate the correctness of an answer, often used when answers are open-ended.
MuSiQue: A multi-hop QA dataset requiring complex reasoning chains to answer.
HotpotQA: A dataset with questions requiring reasoning over multiple supporting documents.
2WikiMultiHopQA: A multi-hop QA dataset constructed from Wikipedia.
Bamboogle: A manually constructed dataset of 2-hop questions designed to be difficult for search engines.
IRCoT: Interleaving Retrieval and Chain-of-Thought—a prompt-based baseline method.