SGO: Student-Generated Outputs—sequences generated by the model currently being trained
TGO: Teacher-Generated Outputs—sequences generated by a larger, more capable frozen model
Agentic RAG: A framework where LLMs autonomously coordinate retrieval, query reformulation, and evidence integration using special action tokens
Cold-start problem: In RL, when a model is too weak to ever generate a correct solution, it never receives a positive reward signal and thus cannot learn
ARC: Agentic RAG Capabilities—a metric proposed in this paper analyzing reasoning, search coordination, and response synthesis separately
Exact Match (EM): A metric checking if the generated answer string exactly matches the ground truth
PPO: Proximal Policy Optimization—an RL algorithm that updates a policy in stable steps
KL divergence: A statistical distance measuring how one probability distribution differs from a reference distribution
Exposure bias: A problem in training where a model learns from ground-truth history during training but must generate its own history during inference, leading to error accumulation