DPO: Direct Preference Optimization—an algorithm that aligns language models to human preferences by directly optimizing the policy on pairwise preference data without a separate reward model
RMAB: Restless Multi-Armed Bandit—a type of resource allocation problem where multiple processes (arms) evolve over time and a limited number can be acted upon at once
DRO: Distributionally Robust Optimization—an optimization framework that minimizes the worst-case loss over a set of possible distributions (ambiguity set) rather than just the empirical average
Ambiguity Set: A set of probability distributions considered possible around the observed data distribution; the model optimizes against the worst distribution in this set
Self-reflection: An inference-time technique where an LLM critiques and refines its own outputs, often using feedback from a simulator
Chi-squared divergence: A statistical measure of the difference between two probability distributions, used here to define the size of the ambiguity set
DLM: Decision Language Model—a baseline method that uses an LLM with self-reflection and iterative feedback to design reward functions
Reward Hacking: When an agent exploits flaws in the reward function to get a high score without actually achieving the intended goal