GFlowNets: Generative Flow Networks—a probabilistic framework for training policies to sample objects in proportion to their reward, rather than just maximizing reward
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a sampled group of outputs to reduce variance without a value function
CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps before the final answer
Trajectory Balance: A loss function from GFlowNets that enforces conservation of probability flow along a trajectory
Reverse KL Divergence: A measure of difference between two probability distributions; minimizing it forces the model to cover the modes of the target distribution
Partition Function: A normalization constant Z(x) that ensures unnormalized energy or reward values sum to 1 to form a valid probability distribution
Importance Sampling: A technique to estimate properties of a distribution using samples from a different distribution, used here to train on 'stale' data from an old policy
PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to ensure stability
Mode Collapse: A failure mode in generative models where the output diversity drops and the model produces only a limited set of similar samples