endogenous reward: A reward signal extracted directly from the internal representations (logits) of a pre-trained or SFT model, rather than from an external reward model
IRL: Inverse Reinforcement Learning—the problem of deriving a reward function from observed optimal behavior (demonstrations)
soft Q-function: A value function in entropy-regularized RL that estimates the expected return plus entropy; the paper links LLM logits directly to this function
inverse soft Bellman operator: The mathematical operator used to recover the reward function from the soft Q-function (logits)
SFT: Supervised Fine-Tuning—training a model on high-quality demonstrations using next-token prediction
RLHF: Reinforcement Learning from Human Feedback—a method to align models using a reward model trained on human preferences
compounding errors: The accumulation of small prediction errors over a sequence, causing the model to drift far from the optimal trajectory; RL corrects this better than imitation learning
LLM-as-a-judge: Using a powerful LLM to evaluate and score the outputs of other models, often used as a proxy for human evaluation
Bradley-Terry model: A statistical model used in RLHF to estimate the probability that one response is preferred over another based on reward scores
MaxEnt IRL: Maximum Entropy Inverse Reinforcement Learning—a framework that seeks a reward function explaining expert behavior while maximizing entropy (randomness) to avoid over-fitting