The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling

Language models trained via RLHF achieve better downstream performance when guided by moderately accurate reward models rather than highly accurate ones, challenging the assumption that better reward classifiers yield better generators.

Core Problem

The prevailing assumption in RLHF is that higher reward model accuracy leads to better language model alignment, but this relationship is not monotonic.

Why it matters:

Blindly maximizing reward model accuracy may waste computational resources without improving the final language model
Highly accurate reward models often lead to overfitting or reward hacking, where the policy exploits the reward function rather than learning the intended behavior
Understanding this dynamic is crucial for optimizing RLHF pipelines for complex tasks like long-form question answering

Concrete Example: In a completeness task, a highly accurate reward model might assign low, conservative scores to most outputs, failing to provide the gradient signal needed for learning. A moderately accurate model, by providing more variable and aggressive rewards, encourages the generator to explore and eventually produce better text.

Key Novelty

The Accuracy Paradox

Demonstrates an inverted-U relationship (or non-monotonicity) where intermediate reward model strength yields optimal language model performance
Shows that highly accurate reward models can be too rigid or conservative, while moderate ones provide 'noisier' but more shaping-friendly feedback that aids exploration
Identifies that moderate reward models maintain better KL divergence stability, preventing the policy from collapsing into narrow, over-optimized regions

Architecture

3D surface plots showing the relationship between Reward Model Accuracy, RM Trained Steps, and final LM Performance for the Relevance task (T5-small).

Evaluation Highlights

In the Relevance task, T5-small trained with a moderate reward model (acc ~0.60) outperformed those trained with the most accurate reward model (acc ~0.70) by roughly +10% in final LM performance score
In the Factuality task, T5-small achieved peak performance (~0.95 score) with reward models of ~0.72 accuracy, dropping significantly as reward model accuracy approached 0.77
In the Completeness task, moderate reward models led to final LM scores near 0.8, whereas the highest accuracy reward models suppressed performance to near 0.0

Breakthrough Assessment

8/10

Identifies a counter-intuitive phenomenon that fundamentally challenges standard RLHF practices (simply scaling up reward models). The findings are robust across model sizes and tasks.

⚙️ Technical Details

Problem Definition

Setting: RLHF fine-tuning of T5 models using PPO, guided by binary classification reward models of varying strengths

Inputs: Long-form questions from the QA-FEEDBACK dataset

Outputs: Generated long-form answers aligned with specific attributes (relevance, factuality, completeness)

Pipeline Flow

Supervised Fine-Tuning (SFT) of T5 Policy
Reward Model Training (Longformer) at varying steps/accuracies
RLHF Training (PPO) using selected Reward Models
Evaluation via Independent High-Accuracy Oracle Reward Models

System Modules

Policy Model

Generate answers to questions

Model or implementation: T5-small, T5-base, T5-large

Reward Model (Evaluation)

Provide scalar feedback to the policy during training

Model or implementation: Longformer-base-4096

Oracle Evaluator (Evaluation)

Assess final performance of the trained policy

Model or implementation: Independent high-accuracy Longformer classifiers

Modeling

Base Model: T5-small, T5-base, T5-large

Training Method: PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while staying close to the reference policy.

Formally: Standard PPO objective with KL penalty term.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of the T5 policy model

Training Data:

QA-FEEDBACK dataset (derived from ASQA)
Train/Val/Test splits: 3,853 / 500 / 948

Key Hyperparameters:

learning_rate: 1e-5
ppo_clip_range: 0.2
kl_coefficient: 0.3
+ 3 more
total_episodes: 80,000
max_input_length: 1024
max_generated_length: 200

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RLHF: This paper intentionally selects suboptimal (intermediate) reward models for training, whereas standard approaches select the highest-accuracy checkpoint.
vs. Weak-to-Strong Generalization [not cited in paper]: Similar conceptual finding that weaker supervisors can elicit strong capabilities, but applied specifically to RLHF reward dynamics.

Limitations

Experiments limited to the QA-FEEDBACK dataset (long-form answers), may not generalize to other tasks.
Uses T5 models only; behavior with larger LLMs like Llama-2/3 not verified.
Relies on Longformer-based reward models; different RM architectures not explored.

Reproducibility

Code: https://github.com/EIT-NLP/AccuracyParadox-RLHF

Code available at https://github.com/EIT-NLP/AccuracyParadox-RLHF. Dataset (QA-FEEDBACK) is public. Hyperparameters detailed in Appendix D.

📊 Experiments & Results

Evaluation Setup

Train T5 models using PPO with reward models of varying accuracy levels, then evaluate final outputs using a separate, high-quality 'Oracle' reward model.

Benchmarks:

QA-FEEDBACK (Relevance) (Long-form QA Relevance)
QA-FEEDBACK (Factuality) (Long-form QA Factuality)
QA-FEEDBACK (Completeness) (Long-form QA Completeness)

Metrics:

LM Performance (Score from Oracle Reward Models)
KL Divergence
Reward Mean and Variance
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results across all three tasks (Relevance, Factuality, Completeness) consistently show that peak Language Model (LM) performance is achieved with Reward Models (RM) of moderate accuracy, rather than the highest accuracy.
QA-FEEDBACK (Relevance)	LM Performance Score	0.55	0.65	+0.10
QA-FEEDBACK (Factuality)	LM Performance Score	0.80	0.95	+0.15
QA-FEEDBACK (Completeness)	LM Performance Score	0.05	0.80	+0.75

Experiment Figures

Comparison of reward distributions (raw reward, mean, variance) between the 'Most Accurate RM' and the 'Best-Performing RM' for the Relevance task.

KL Divergence trends during training for Relevance task.

Main Takeaways

Moderate reward models provide higher reward variance and mean scores compared to highly accurate ones, encouraging exploration.
Highly accurate reward models tend to be 'conservative,' often giving lower rewards that discourage the model from learning effectively (especially in completeness tasks).
Models trained with moderate reward models exhibit more stable KL divergence profiles, suggesting a balanced training process that avoids mode collapse or over-optimization.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Reward Modeling / Preference Modeling
KL Divergence

Key Terms

RLHF: Reinforcement Learning from Human Feedback—aligning language models with human intent by training them to maximize a learned reward function

PPO: Proximal Policy Optimization—an RL algorithm that updates a policy in stable steps by clipping the objective function to prevent large, destructive updates

Reward Gaming: When a model learns to exploit flaws in the reward function to get high scores without actually improving quality

KL divergence: A metric measuring how much the trained probability distribution deviates from a reference distribution (usually the SFT model), used to stabilize training

Longformer: A Transformer architecture optimized for long sequences using sparse attention mechanisms, used here as the reward model backbone

Exposure Bias: A discrepancy in text generation where models are trained on ground truth but generate based on their own previous predictions

T5: Text-to-Text Transfer Transformer—an encoder-decoder language model architecture