GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy based on the relative performance of a group of outputs for the same input, often without a separate value network
RoC: Resample-on-Correct—a strategy that oversamples trajectories and filters the positive ones to retain only high-quality traces (few errors/formatting issues) for training
outcome-only reward: A reward signal given solely based on whether the final answer is correct, ignoring the quality of intermediate steps
KV cache: Key-Value cache—memory used by LLMs to store attention mechanism computations, optimizing generation speed but consuming GPU memory
SFT: Supervised Fine-Tuning—training a model on labeled examples to establish basic capabilities before reinforcement learning
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
reward hacking: When an RL agent finds a way to maximize the reward signal (e.g., getting the right answer) using undesirable behaviors (e.g., guessing or writing messy code)
rollout: The process of generating a complete sequence of actions (tokens and tool calls) from the policy during RL training