Soft Bellman Equation: A consistency condition in maximum entropy RL relating the optimal value function to the immediate reward and the entropy-regularized value of the next state
DPO: Direct Preference Optimization—an offline method aligning models to preferences by optimizing the policy directly without a separate reward model, typically requiring paired data
Rejection Sampling: A simple baseline where the model generates multiple samples, filters for correct ones, and fine-tunes on those correct trajectories (also known as STaR)
Sparse Reward: A setting where feedback (reward) is only received at the end of a task (e.g., correct answer), not at every intermediate step
Credit Assignment: The problem of determining which past actions contributed to a final outcome; difficult in reasoning when a long chain yields a single final score
PCL: Path Consistency Learning—an algorithm that unifies value and policy learning by enforcing consistency between values along a trajectory
Value Function: A model that predicts the expected future cumulative reward from a given state (partial reasoning chain)
Beam Search: A search algorithm that explores a graph by expanding the most promising nodes in a limited set
SFT: Supervised Fine-Tuning—training on labeled target outputs