Evol-Instruct: A method using LLMs to automatically generate complex and diverse instructions by iteratively rewriting them (e.g., adding constraints, deepening reasoning)
RLEIF: Reinforcement Learning from Evol-Instruct Feedback—the paper's proposed method combining evolutionary data generation with RL optimization
PRM: Process-supervised Reward Model—a reward model that scores the correctness of each individual step in a reasoning chain, rather than just the final answer
IRM: Instruction Reward Model—a reward model trained to predict the quality (difficulty and clarity) of the mathematical instructions themselves
PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to update the language model policy
GSM8k: A benchmark dataset of high-quality grade school math word problems
MATH: A benchmark dataset of challenging mathematics problems (algebra, geometry, calculus, etc.)
SFT: Supervised Fine-Tuning
False-Positive: A scenario in reasoning where the final answer is correct but the intermediate reasoning steps contain errors