Evaluation Setup
Three-part study: (1) Human annotation on Prolific, (2) Multi-turn LLM judge evaluation, (3) Reward model training under label corruption.
Benchmarks:
- Anthropic HH-RLHF (Pairwise preference modeling)
Metrics:
- Non-detection rate (Human/LLM)
- Pairwise Accuracy (Reward Model)
- Mean Reward Margin (chosen - rejected)
- Best-of-N Gold Score
- Statistical methodology: 95% Wilson Confidence Intervals, Fisher's exact test, Paired t-tests, Nonlinear least squares for sigmoid decay
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Human annotation experiments reveal extremely high rates of choice blindness. |
| HH-RLHF Annotation |
Non-detection rate |
0.0 |
91.0 |
+91.0
|
| LLM experiments show that detection relies on shallow context matching rather than deep self-monitoring. |
| Multi-turn Evaluation |
Blindness Acceptance (DeepSeek-R1) |
1.5 |
51.7 |
+50.2
|
| Multi-turn Evaluation |
Acceptance Rate |
Not reported in the paper |
91.4 |
Not reported in the paper
|
| Reward model experiments demonstrate that standard accuracy metrics fail to capture signal degradation from corruption. |
| HH-RLHF (DeBERTa) |
Pairwise Accuracy |
Not reported in the paper |
61.0 |
Not reported in the paper
|
| HH-RLHF (DeBERTa) |
ED50 (Reward Margin) |
0.0 |
16.3 |
+16.3
|
Main Takeaways
- Preference Construction Problem: Labels are shaped by the elicitation context; they are not stable internal states retrieved by annotators.
- Detection Gap: A reward model can be trained on up to 30% corrupted data without showing significant drops in standard pairwise accuracy, despite the reward signal (margin) degrading linearly.
- Targeted Corruption: Corrupting 'hard' (low margin) pairs is far more damaging than random corruption, destroying the signal while barely affecting accuracy.
- LLM 'Self-Monitoring' is an Illusion: Most models detect swaps by matching text in their context window; removing that context reveals they cannot genuinely recall or defend their original preferences.