RLVR: Reinforcement Learning with Verifiable Rewards—using automated signals (math/code correctness) instead of a learned reward model
CoT: Chain-of-Thought—a prompting strategy where models generate intermediate reasoning steps before the final answer
PPO: Proximal Policy Optimization—a standard RL algorithm used to fine-tune LLMs by optimizing a policy within a trust region
vLLM: A high-throughput LLM inference and serving library known for PagedAttention
PagedAttention: A memory management technique in vLLM that reduces memory waste by partitioning K/V cache into non-contiguous blocks
DeepSpeed ZeRO: Zero Redundancy Optimizer—a memory optimization strategy that partitions model states across data-parallel processes
AutoTP: Automatic Tensor Parallelism—DeepSpeed feature that automatically splits tensor operations across GPUs without manual layer injection policies
Ring Attention: A sequence parallelism technique using ring-based communication to distribute attention computation for very long sequences
Ray: A unified framework for scaling AI and Python applications, used here for orchestrating distributed actors
GRPO: Group Relative Policy Optimization—an RL algorithm often used in reasoning tasks
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization—a specific RL algorithm variation used in the paper's experiments