← Back to Paper List

Generative Reward Models

Dakota Mahan, Duy Phung, Rafael Rafailov, Chase Blagden, nathan lile, Louis Castricato, Jan-Philipp Franken, Chelsea Finn, Alon Albalak
SynthLabs, Stanford University
arXiv.org (2024)
RL Reasoning Benchmark

📝 Paper Summary

Reward Modeling RLHF / RLAIF
GenRM trains a reward model to generate reasoning traces before rendering a verdict using an iterative self-teaching loop, achieving superior out-of-distribution generalization compared to standard discriminative classifiers.
Core Problem
Standard reward models (discriminative classifiers) perform well on training data but generalize poorly to new distributions, while LLM-as-a-judge approaches are robust but lack alignment with specific human preferences.
Why it matters:
  • RLHF relies on accurate reward models to guide policy optimization; if the reward model fails on out-of-distribution data, the resulting LLM may be misaligned
  • Collecting human preference data is resource-intensive, making high-quality synthetic preferences (RLAIF) crucial for scaling
  • Current hybrid methods struggle to combine the in-distribution accuracy of trained reward models with the reasoning capabilities of large language models
Concrete Example: In a case study (Figure 3), a standard LLM judge incorrectly prefers a detailed response about '2 animals' that ignores the length constraint. The proposed STaR-DPO model generates a rationale explicitly noting the failure to follow instructions ('lacks depth... does not follow instruction') and correctly prefers the compliant response.
Key Novelty
Generative Reward Models (GenRM) with Self-Taught Reasoning
  • Reformulates reward modeling as a generative task where the model produces a Chain-of-Thought rationale followed by a preference token, rather than outputting a scalar score
  • Uses a STaR (Self-Taught Reasoner) loop to bootstrap training data: the model generates its own rationales, filters for those leading to correct ground-truth labels, and trains on them
  • Applies DPO (Direct Preference Optimization) to the reasoning traces themselves, optimizing the model to prefer rationales that result in correct judgments over those that do not
Evaluation Highlights
  • STaR-DPO achieves 91.0% accuracy on RewardBench Safety, significantly outperforming the best baseline PairRM (81.8%)
  • On RewardBench Reasoning tasks, STaR-DPO scores 87.2%, surpassing the standard generative reward model (GenRM) which scores only 70.8%
  • Maintains in-distribution performance parity with Bradley-Terry models (73.9% vs ~74%) while outperforming them on out-of-distribution tasks (81.9% vs <60% for BT)
Breakthrough Assessment
8/10
Significantly improves reward model robustness and OOD generalization by successfully applying reasoning-based reinforcement learning (STaR/DPO) to the evaluation process itself.
×