ARL: Agentic Reinforcement Learning—training LLM agents to solve multi-step interactive tasks via reinforcement learning
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input, avoiding a separate value network
Importance Sampling (IS): A technique to estimate properties of a target distribution using samples from a different proposal distribution, weighing samples by the ratio of their probabilities
Clipping: Constraining the policy update ratio (new policy / old policy) to a small range (e.g., 0.9 to 1.1) to prevent destructively large updates
PPO: Proximal Policy Optimization—a standard RL algorithm that uses clipping to ensure stable policy updates
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution
Behavior Cloning (BC): Supervised learning where the agent learns to mimic expert demonstrations or high-quality self-generated trajectories
SFT: Supervised Fine-Tuning—training the model on labeled data before RL
Off-policy staleness: The discrepancy that arises when the policy being updated has drifted significantly from the policy that generated the training data
Pass@k: A metric measuring the probability that at least one correct solution is found in k generated samples