← Back to Paper List

WARM: On the Benefits of Weight Averaged Reward Models

Alexandre Ram'e, Nino Vieillard, L'eonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
Google DeepMind
International Conference on Machine Learning (2024)
RL P13N Benchmark

📝 Paper Summary

Reward Modeling Reinforcement Learning from Human Feedback (RLHF) Model Merging
WARM creates a single robust reward model by averaging the weights of multiple fine-tuned reward models, efficiently improving reliability and reducing reward hacking compared to standard ensembling.
Core Problem
Aligning LLMs via RLHF often suffers from reward hacking, where the policy exploits flaws in the proxy reward model to get high scores without meeting actual objectives.
Why it matters:
  • Reward hacking leads to degraded performance, such as verbosity or linguistic flaws, that do not reflect true human preferences
  • Distribution shifts between training data and policy generations make single reward models unreliable during RL
  • Inconsistencies and noise in human preference labels (approx. 72.6% agreement) complicate the learning of robust reward signals
  • Standard ensembling (averaging predictions) is computationally expensive, requiring memory and inference for M distinct models
Concrete Example: A policy might learn to exploit a reward model by generating unnecessarily verbose outputs or specific formatting quirks (bullet points) that the reward model overvalues, achieving a high score while failing to provide a helpful summary.
Key Novelty
Weight Averaged Reward Models (WARM)
  • Fine-tune multiple reward models (RMs) from the same pre-trained initialization but with different hyperparameters or data orders
  • Linearly interpolate the weights of these diverse RMs into a single model (Weight Averaging), leveraging Linear Mode Connectivity
  • This process isolates invariant predictive mechanisms across runs, filtering out noise-specific features that lead to overfitting or memorization of corrupted labels
Evaluation Highlights
  • A policy RL fine-tuned with WARM achieves a 79.4% win rate against a policy fine-tuned with the best single individual Reward Model
  • WARM (M=6) reaches a 92.5% win rate in Best-of-N sampling against the random selection baseline (SFT)
  • Under 25% label corruption, WARM significantly reduces memorization of noisy labels compared to prediction ensembling
Breakthrough Assessment
8/10
Offers a highly practical, efficiency-improving solution to a critical RLHF problem (reward hacking). Theoretical insights on invariance vs. memorization in weight averaging are valuable.
×