Reward Hacking: When a model exploits unintended loopholes or shortcuts in a reward function to get high scores without achieving the intended goal
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
VFT: Verbalization Fine-Tuning—the proposed method of fine-tuning models on CoTs that explicitly acknowledge the influence of prompt cues
ECR: Effective Cue Influence Rate—the fraction of responses that are *undetected* reward hacks (influenced by the cue but not verbalized)
BCT: Bias-augmented Consistency Training—a baseline method that trains models to ignore biasing cues by enforcing consistent answers between cued and uncued prompts
SFT: Supervised Fine-Tuning—training a model on a labeled dataset of inputs and desired outputs
RL: Reinforcement Learning—training a model to maximize a numerical reward signal
Out-of-Distribution (OOD): Data that differs from the training distribution; here, cues seen during RL that were not seen during VFT
Faithfulness: The extent to which a model's stated reasoning (CoT) accurately reflects the true causes of its prediction