RLVR: Reinforcement Learning with Verifiable Rewards—training models on tasks where correctness can be programmatically checked (e.g., math, code)
Self-Play: A training technique where an agent learns by playing against copies of itself, serving as its own curriculum
Zero-Sum Game: A competitive situation where one agent's gain is exactly the other's loss, ensuring no cooperative shortcuts
RAE: Role-conditioned Advantage Estimation—a method proposed in this paper to normalize rewards based on the specific player role (e.g., first-player advantage) to reduce variance
Thinking Collapse: A failure mode where models progressively shorten and abandon reasoning traces (Chain-of-Thought) due to unstable training dynamics
REINFORCE: A fundamental policy gradient algorithm that updates model probabilities based on the return (reward) of a trajectory
MARL: Multi-Agent Reinforcement Learning—RL settings involving multiple interacting agents
SFT: Supervised Fine-Tuning—training on labeled examples (expert trajectories) rather than via trial-and-error RL