Reward Model Ensembles Help Mitigate Overoptimization

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling LLM Alignment

Using an ensemble of reward models with conservative optimization objectives (like worst-case or uncertainty-weighted scoring) prevents RLHF policies from exploiting flaws in proxy reward models.

Core Problem

In RLHF, policies are optimized against an imperfect proxy reward model; eventually, the policy learns to exploit errors in this proxy, increasing the proxy score while degrading true performance (overoptimization).

Why it matters:

Overoptimization causes language models to produce gibberish or harmful outputs that technically score high on the reward model but fail the user's intent.
Scaling up reward models to fix this is computationally expensive and requires significant pretraining, which isn't always feasible.
Existing methods rely on single reward models, which are fragile to label noise and inherent approximation errors.

Concrete Example: A policy might generate a nonsensical answer that happens to contain specific keywords the reward model overvalues. Optimizing against a single reward model would encourage this gibberish, whereas an ensemble would likely disagree on its quality, flagging it as uncertain.

Key Novelty

Ensemble-Based Conservative Optimization for RLHF

Instead of training one reward model, train multiple copies (ensemble) using different seeds to capture uncertainty.
During policy optimization, score responses using conservative aggregation: take the lowest score (Worst-Case) or penalize high variance (Uncertainty-Weighted) to discourage the policy from visiting areas where reward models disagree.

Architecture

The modified RLHF pipeline comparing standard approaches vs. the proposed ensemble approach.

Evaluation Highlights

Eliminates overoptimization for Best-of-N sampling, improving performance by up to ~75% in settings with 25% label noise.
Outperforms single reward model optimization in PPO, and when combined with a small KL penalty (0.01), successfully eliminates overoptimization without performance cost.
Gains are orthogonal to model scaling: ensembles improve performance even when added to larger reward models (up to 1.3B parameters).

Breakthrough Assessment

7/10

Provides a practical, effective solution to a major RLHF failure mode (overoptimization) using well-understood ensemble techniques. While conceptually simple, the systematic empirical validation makes it a strong contribution.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from Human Feedback (RLHF) where a policy is optimized against a learned proxy reward model trained on preference data.

Inputs: Prompt q

Outputs: Response a

Pipeline Flow

Prompt q → Policy Model → Response a
Response a → Reward Model Ensemble {R1...Rk} → {r1...rk}
Scores {r1...rk} → Aggregation (Mean/WCO/UWO) → Final Reward
Final Reward → PPO Update or BoN Selection

System Modules

Policy Model

Generates responses to prompts

Model or implementation: Pythia 1.4B

Reward Model Ensemble

Estimates the quality of the response

Model or implementation: Ensemble of 5 Pythia models (sizes 7M, 44M, or 1.3B) with scalar heads

Aggregator

Combines ensemble scores into a single reward signal for optimization

Model or implementation: Deterministic function (Mean, Min, or Mean - λ*Var)

Novel Architectural Elements

Integration of ensemble-based conservative aggregation (WCO/UWO) directly into the RLHF reward calculation loop

Modeling

Base Model: Pythia 1.4B (Policy), Pythia 14M/70M/1.4B (Reward Models)

Training Method: PPO and Best-of-N (BoN)

Objective Functions:

Purpose: Worst-Case Optimization (Conservative).

Formally: R_WCO(q, a) = min_{i=1..k} R_i(q, a)
Purpose: Uncertainty-Weighted Optimization (Conservative).

Formally: R_UWO(q, a) = mean(R(q, a)) - λ * Var(R(q, a))

Training Data:

AlpacaFarm dataset (52k instructions)
Preference data generated by sampling 2 responses from SFT model
Scored by Gold Reward Model (AlpacaFarm 7B)
Synthetic label noise added: 0% or 25% of labels flipped

Key Hyperparameters:

ensemble_size: 5
ppo_steps: 3000
bon_max_n: 12,500
+ 3 more
uwo_lambda: 0.5 (default)
label_noise: 25% or 0%
kl_penalty_coefficient: 0.01 (optimized PPO)

Compute: Generating N=12,500 BoN answers takes approx 700 A100 GPU hours

Comparison to Prior Work

vs. Standard RLHF: Uses ensembles (5 models) instead of 1, and conservative aggregation (min/variance-penalty) instead of raw score
vs. Gao et al. (2023): Extends setup to include 25% label noise (more realistic) and proposes ensembles as a mitigation strategy

Limitations

Relies on a synthetic 'Gold' reward model rather than real human evaluation.
Ensembles increase inference cost during training (k times more forward passes for reward calculation).
Experiments limited to relatively small models (up to 1.4B policy) compared to state-of-the-art LLMs.
Conservative optimization (WCO) can sometimes be too pessimistic, potentially reducing peak performance.

Reproducibility

Code: https://github.com/tlc4418/llm_optimization

Code is publicly available. Uses open-source Pythia models and AlpacaFarm dataset. Synthetic 'Gold' reward model setup ensures reproducibility without human annotators.

📊 Experiments & Results

Evaluation Setup

Synthetic RLHF setup using a 'Gold' reward model (AlpacaFarm 7B) as the ground truth evaluator.

Benchmarks:

AlpacaFarm (Instruction Following)

Metrics:

Gold Reward Model Score
Proxy Reward Model Score
KL Divergence from initial policy
Win-rate
Statistical methodology: PPO runs averaged over 3 seeds with standard deviation shown. Single RM baselines averaged over 5 individual RMs.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BoN Sampling results showing impact of ensembles on overoptimization with label noise.
AlpacaFarm (BoN)	Gold Reward Score Improvement	See Note	See Note	+75%
PPO results comparing single models vs ensembles with KL penalties.
AlpacaFarm (PPO)	Overoptimization Mitigation	Regresses after ~20 KL	Stable monotonic increase	Eliminated
AlpacaFarm (PPO)	Gold Reward Score	Lower	Higher	Positive

Experiment Figures

BoN performance curves (Gold Score vs KL) for Noiseless and 25% Noise settings.

PPO learning curves (Gold/Proxy Score vs KL) with KL penalty = 0.01.

Intra-ensemble variance during PPO training.

Main Takeaways

Ensemble methods (WCO/UWO) effectively prevent the policy from exploiting specific weaknesses in individual reward models.
Label noise (25%) exacerbates overoptimization in single models and Mean ensembles, but UWO handles it robustly by penalizing high-variance predictions.
Benefits of ensembles stack with model scaling; even with larger 1.3B reward models, ensembles provide gains over single models.
PPO requires a small KL penalty combined with ensembles to fully eliminate overoptimization, whereas BoN achieves it with ensembles alone.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Best-of-N (BoN) / Rejection Sampling
Ensemble methods

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using a reward model trained on human preferences

Overoptimization: A phenomenon where optimizing a policy against an imperfect proxy reward model eventually leads to a regression in performance according to the true reward (also known as Goodhart's Law)

Proxy Reward Model: A neural network trained to approximate human preferences (or a 'gold' standard) based on limited data

Gold Reward Model: In this synthetic setup, a large, fixed model acting as the ground-truth 'human' evaluator

BoN: Best-of-N sampling—generating N responses and selecting the one with the highest reward score

PPO: Proximal Policy Optimization—an RL algorithm that updates the policy to maximize reward while limiting how much the policy changes at each step

WCO: Worst-Case Optimization—using the minimum score from an ensemble of reward models as the training signal

UWO: Uncertainty-Weighted Optimization—using the mean score minus a weighted variance term from an ensemble as the training signal

KL divergence: A measure of difference between two probability distributions, used here to measure how far the optimized policy has drifted from the initial model