Activation Steering: Modifying a model's behavior by directly intervening on its internal representations (activations) during inference rather than changing its weights
Reward Hacking: When a model exploits flaws or biases in a reward function (e.g., length bias) to get a high score without actually satisfying the intended objective
REINFORCE: A gradient estimation algorithm used in reinforcement learning to optimize non-differentiable objectives (used here to select discrete attention heads)
LLM-as-a-Judge: Using a large language model to evaluate the quality of text by prompting it to act as a judge/scorer
PreferenceHack: A new benchmark proposed in this paper to test reward models' robustness against specific biases (length, format, positivity) in a paired preference setting
Task Vector: A vector representation derived from model activations that captures a specific task or behavior, often added to the model to steer it
Attention Head: A component in Transformer models that attends to different parts of the input sequence; this paper steers specific heads to encode preferences
LMM: Large Multimodal Model—an AI model capable of processing and generating multiple modalities (e.g., text and images)