RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness (e.g., code compiles, math answer matches) as the reward signal instead of a neural reward model
PPO: Proximal Policy Optimization—an RL algorithm that updates the model policy while limiting how much it changes at each step to ensure stability
MoE: Mixture-of-Experts—a model architecture where different sub-modules (experts) activate for different inputs, allowing massive parameter counts with lower inference cost
Rollout: The process of the model generating text (acting) and interacting with an environment to produce a trajectory for training
vLLM: A high-throughput library for LLM inference and serving
Ray: A unified framework for scaling AI and Python applications, used here to orchestrate distributed workers
Zero-Redundancy Optimizer (ZeRO): A memory optimization technique that partitions model states across data-parallel processes to reduce memory footprint per GPU