PRM: Process Reward Model—a model that evaluates the correctness of each intermediate step in a reasoning chain, providing dense feedback
Credit Assignment: The problem of determining which past actions are responsible for a final outcome or reward
RFT: Reinforcement Fine-Tuning—improving a pre-trained model using reinforcement learning algorithms like PPO or GRPO
Verifiable Reward: A sparse reward signal (usually 0 or 1) given only at the end of generation based on whether the final answer matches the ground truth
RLOO: REINFORCE Leave-One-Out—a policy gradient estimator that uses the average reward of other samples as a baseline to reduce variance
DPO: Direct Preference Optimization—a method to align models using preference pairs without an explicit reward model loop
Reward Hacking: When an agent exploits flaws in the reward function to maximize points without achieving the intended goal (e.g., generating gibberish that the reward model likes)
Best-of-N: An inference strategy where the model generates N solutions and a reward model selects the best one