Aligning to Illusions: Choice Blindness in Human and AI Feedback

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) AI Psychology / Cognitive Science Data Quality & Label Noise

Experiments reveal that both human annotators and AI judges fail to detect swapped preferences, creating corrupted training signals that standard reward modeling metrics cannot identify.

Core Problem

RLHF relies on the assumption that preference judgments are stable and interchangeable, but humans and models often accept manipulated choices they never made (choice blindness).

Why it matters:

Current alignment pipelines treat labels as ground truth, but if preferences are constructed in the moment, the training signal is unstable and context-dependent
Standard evaluation metrics like pairwise accuracy remain high even when the reward signal is significantly corrupted, masking model degradation
Safeguards against random noise fail to address 'preference construction,' where the elicitation context itself shapes the label

Concrete Example: A participant selects an explanation of why 'Lucifer' is the 'Morning Star' (Venus). The system surreptitiously swaps this for the rejected response containing factual errors. Instead of objecting, the participant accepts the swap and confabulates a justification: 'It provides specific and accurate information... [about the] night sky,' defending the choice they explicitly rejected moments ago.

Key Novelty

Choice Blindness applied to RLHF pipelines

Adapts the psychological 'choice blindness' paradigm (swapping a subject's decision and asking for justification) to text-based RLHF annotation for both humans and LLMs
Demonstrates a 'detection gap': shows that reward models trained on systematically corrupted labels maintain high test accuracy while the actual reward signal degrades
Identifies 'zero-pressure misattribution' vulnerability in LLMs, where models reverse their reasoning simply because a user states 'So you preferred X' even if they preferred Y

Evaluation Highlights

91.0% of surreptitiously swapped preference trials went undetected by human annotators (N=50)
Removing prior reasoning from context caused LLM judge blindness to surge from <2% to over 50% for models like DeepSeek-R1
Reward models retained >61% pairwise accuracy even with 30% label corruption, despite the true reward signal effectively halving (ED50 ~16-33%)

Breakthrough Assessment

9/10

Fundamentally challenges the stability assumption of RLHF data. The dissociation between metric stability (accuracy) and signal degradation (reward margin) is a critical insight for alignment safety.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference ranking for RLHF (Reinforcement Learning from Human Feedback)

Inputs: Prompt x and two model responses (y_chosen, y_rejected)

Outputs: Scalar reward score r(x,y) or binary preference label

Pipeline Flow

Preference Elicitation (Human/AI makes a choice)
Manipulation (Surreptitious Swap or Calm Misattribution)
Justification/Detection (Subject explains 'their' choice)
Reward Modeling (Training on potentially corrupted labels)

System Modules

Human/LLM Annotator

Review pair of responses and select the better one

Model or implementation: Humans (N=50) or LLMs (e.g., DeepSeek-V3, GPT-4o-mini)

Manipulation Mechanism

Swap the selected label to the rejected one before asking for justification

Model or implementation: Rule-based script

Reward Model

Learn to score responses based on the (potentially corrupted) dataset

Model or implementation: DeBERTa-v3-base or Gemma-2-2B

Novel Architectural Elements

Integration of a 'Choice Blindness' injection step into the standard RLHF annotation pipeline to measure label stability

Modeling

Base Model: DeBERTa-v3-base (86M) and Gemma-2-2B

Training Method: Reward Modeling via Bradley-Terry Loss

Objective Functions:

Purpose: Maximize the likelihood of the chosen response having a higher score than the rejected response.

Formally: L = -log(sigmoid(r(x, y_w) - r(x, y_l)))

Training Data:

Anthropic HH-RLHF dataset (160k training pairs)
Labels corrupted by swapping chosen/rejected at rates s ∈ {0%, 10%, 20%, 30%, 50%}

Key Hyperparameters:

learning_rate: 2e-5 (DeBERTa), 1e-5 (Gemma)
batch_size: 48 (DeBERTa), 16 with grad accum 3 (Gemma)
sequence_length: 512
+ 1 more
precision: bf16

Compute: NVIDIA A800-80GB (DeBERTa) and NVIDIA RTX 5090-32GB (Gemma)

Comparison to Prior Work

vs. Robust RLHF: Shows that systematic 'constructed' noise (choice blindness) is harder to detect than random noise and resists standard metrics [not cited in paper]
vs. Standard LLM-as-a-Judge: Reveals that high accuracy on static benchmarks masks vulnerability to 'calm misattribution' and lack of genuine self-monitoring

Limitations

Human study limited to text-based evaluation of unfamiliar AI responses; results may differ for deeply held personal beliefs.
Targeted corruption experiment only tested 'easy' vs 'hard' swaps based on margin, not semantic categories.
Detection gap methods (e.g., multi-seed testing) require infrastructure often absent in standard deployment.

Reproducibility

Data, model weights, and analysis code available upon reasonable request. Uses public Anthropic HH-RLHF dataset. Detailed prompts and experimental design provided in text.

📊 Experiments & Results

Evaluation Setup

Three-part study: (1) Human annotation on Prolific, (2) Multi-turn LLM judge evaluation, (3) Reward model training under label corruption.

Benchmarks:

Anthropic HH-RLHF (Pairwise preference modeling)

Metrics:

Non-detection rate (Human/LLM)
Pairwise Accuracy (Reward Model)
Mean Reward Margin (chosen - rejected)
Best-of-N Gold Score
Statistical methodology: 95% Wilson Confidence Intervals, Fisher's exact test, Paired t-tests, Nonlinear least squares for sigmoid decay

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human annotation experiments reveal extremely high rates of choice blindness.
HH-RLHF Annotation	Non-detection rate	0.0	91.0	+91.0
LLM experiments show that detection relies on shallow context matching rather than deep self-monitoring.
Multi-turn Evaluation	Blindness Acceptance (DeepSeek-R1)	1.5	51.7	+50.2
Multi-turn Evaluation	Acceptance Rate	Not reported in the paper	91.4	Not reported in the paper
Reward model experiments demonstrate that standard accuracy metrics fail to capture signal degradation from corruption.
HH-RLHF (DeBERTa)	Pairwise Accuracy	Not reported in the paper	61.0	Not reported in the paper
HH-RLHF (DeBERTa)	ED50 (Reward Margin)	0.0	16.3	+16.3

Main Takeaways

Preference Construction Problem: Labels are shaped by the elicitation context; they are not stable internal states retrieved by annotators.
Detection Gap: A reward model can be trained on up to 30% corrupted data without showing significant drops in standard pairwise accuracy, despite the reward signal (margin) degrading linearly.
Targeted Corruption: Corrupting 'hard' (low margin) pairs is far more damaging than random corruption, destroying the signal while barely affecting accuracy.
LLM 'Self-Monitoring' is an Illusion: Most models detect swaps by matching text in their context window; removing that context reveals they cannot genuinely recall or defend their original preferences.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
LLM-as-a-Judge paradigms
Cognitive Science (Choice Blindness)

Key Terms

Choice Blindness: A cognitive phenomenon where people fail to notice when the outcome of their choice is swapped with the option they rejected, often confabulating reasons for the swapped outcome

RLHF: Reinforcement Learning from Human Feedback—aligning AI models using a reward model trained on human preferences

ED50: Effective Dose 50—in this context, the percentage of corrupted labels required to reduce the reward model's signal strength (margin) by half

Best-of-N: A policy selection method where N responses are generated and the one with the highest reward score is selected

Sycophancy: The tendency of a model to agree with the user's stated or implied view, even if it contradicts the model's own prior knowledge or preference

Bradley-Terry Model: A statistical model used to estimate the probability that one item is preferred over another based on their latent reward scores

Confabulation: Generating a plausible-sounding but fabricated justification for a decision or fact