RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using ground-truth checkers (e.g., math answers, unit tests)
RLMT: Reinforcement Learning with Model-rewarded Thinking—the proposed method using preference models to reward reasoning traces in open domains
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance
CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training models on labeled examples (prompt, response pairs) before RL
PPO: Proximal Policy Optimization—a standard on-policy RL algorithm
DPO: Direct Preference Optimization—an offline preference learning algorithm usually used without explicit reward modeling, adapted here for on-policy learning
RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences