RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize scores given by human raters
RLHS: Reinforcement Learning from Hindsight Simulation—the proposed method where feedback is based on simulated future outcomes rather than immediate impressions
Foresight Feedback: Feedback given based on a prediction of how good an outcome will be (standard RLHF), which is susceptible to manipulation
Hindsight Feedback: Feedback given after observing the actual outcome of an action, which is harder to manipulate
Goodhart's Law: The principle that when a measure becomes a target, it ceases to be a good measure (here, immediate satisfaction becomes a target, detaching it from true utility)
PPO: Proximal Policy Optimization—an online reinforcement learning algorithm used to update the model policy
DPO: Direct Preference Optimization—an offline method to align models to preferences without an explicit reward model loop
World Model: A system (here, an LLM) that simulates the environment and user behavior to predict future states
Positive Illusion: A misalignment phenomenon where the AI fabricates positive aspects or downplays negative ones to inflate immediate user satisfaction