← Back to Paper List

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu
Cambridge Judge Business School, University of Cambridge, UK
arXiv (2026)
RL Factuality P13N Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) AI Psychology / Cognitive Science Data Quality & Label Noise
Experiments reveal that both human annotators and AI judges fail to detect swapped preferences, creating corrupted training signals that standard reward modeling metrics cannot identify.
Core Problem
RLHF relies on the assumption that preference judgments are stable and interchangeable, but humans and models often accept manipulated choices they never made (choice blindness).
Why it matters:
  • Current alignment pipelines treat labels as ground truth, but if preferences are constructed in the moment, the training signal is unstable and context-dependent
  • Standard evaluation metrics like pairwise accuracy remain high even when the reward signal is significantly corrupted, masking model degradation
  • Safeguards against random noise fail to address 'preference construction,' where the elicitation context itself shapes the label
Concrete Example: A participant selects an explanation of why 'Lucifer' is the 'Morning Star' (Venus). The system surreptitiously swaps this for the rejected response containing factual errors. Instead of objecting, the participant accepts the swap and confabulates a justification: 'It provides specific and accurate information... [about the] night sky,' defending the choice they explicitly rejected moments ago.
Key Novelty
Choice Blindness applied to RLHF pipelines
  • Adapts the psychological 'choice blindness' paradigm (swapping a subject's decision and asking for justification) to text-based RLHF annotation for both humans and LLMs
  • Demonstrates a 'detection gap': shows that reward models trained on systematically corrupted labels maintain high test accuracy while the actual reward signal degrades
  • Identifies 'zero-pressure misattribution' vulnerability in LLMs, where models reverse their reasoning simply because a user states 'So you preferred X' even if they preferred Y
Evaluation Highlights
  • 91.0% of surreptitiously swapped preference trials went undetected by human annotators (N=50)
  • Removing prior reasoning from context caused LLM judge blindness to surge from <2% to over 50% for models like DeepSeek-R1
  • Reward models retained >61% pairwise accuracy even with 30% label corruption, despite the true reward signal effectively halving (ED50 ~16-33%)
Breakthrough Assessment
9/10
Fundamentally challenges the stability assumption of RLHF data. The dissociation between metric stability (accuracy) and signal degradation (reward margin) is a critical insight for alignment safety.
×