PRM: Process Reward Model—a model that evaluates the correctness of intermediate reasoning steps, not just the final answer
Implicit PRM: A method to derive process rewards from an outcome reward model by comparing the likelihood of the generated text against a reference model
Outcome Verifier: A rule-based function that checks if the final answer matches the ground truth (e.g., exact string match)
Reward Hacking: When an RL agent learns to exploit flaws in the reward model to get high scores without actually performing the task correctly
RLOO: Reinforcement Learning with Leave-One-Out—an algorithm that estimates advantages by comparing a sample's reward to the average of other samples for the same prompt
PPO: Proximal Policy Optimization—an RL algorithm that updates policies conservatively to prevent performance collapse
SFT: Supervised Fine-Tuning—training the model on labeled demonstrations before RL
Credit Assignment: Determining which specific steps in a sequence contributed to the final success or failure
Reward Sparsity: The challenge where the agent only receives feedback at the very end of a long task, making it hard to learn intermediate steps