TPO: Test-time Preference Optimization—the proposed method of aligning models via iterative textual feedback during inference
RLHF: Reinforcement Learning from Human Feedback—a standard training method to align models using a reward model trained on human preferences
DPO: Direct Preference Optimization—a method to align models directly on preference pairs without an explicit reward model loop
Best-of-N: An inference strategy where N responses are generated and the one with the highest reward model score is selected
Textual Gradient: Natural language instructions generated by the model derived from the critique (loss), guiding how to refine the response
Textual Loss: A natural language critique generated by comparing a chosen (high-reward) and rejected (low-reward) response
Policy Model: The language model generating the responses (the 'actor' in RL terms)
SFT: Supervised Fine-Tuning—the initial training phase on high-quality instruction data before preference alignment
LC score: Length-Controlled win rate—a metric used in AlpacaEval to adjust for the bias that longer responses often get higher scores
WR score: Win Rate—the percentage of times a model's output is preferred over a baseline