← Back to Paper List

Generalist Reward Models: Found Inside Large Language Models

Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou
National Key Laboratory for Novel Software Technology, Nanjing University, China, School of Artificial Intelligence, Nanjing University, China
arXiv (2025)
RL Pretraining

📝 Paper Summary

LLM Alignment Reward Modeling Inverse Reinforcement Learning
The paper proves that a high-quality reward function is already implicitly encoded within any LLM trained via next-token prediction and can be extracted to self-improve the model without external preference data.
Core Problem
Aligning LLMs typically relies on training separate reward models (RMs) using expensive human preference data or heuristic AI feedback, which lacks theoretical grounding and scalability.
Why it matters:
  • Building massive, high-quality human preference datasets is slow, expensive, and difficult to scale
  • Current AI feedback methods (LLM-as-a-judge) are often heuristic and inherit biases from the judge model
  • Existing methods lack a rigorous theoretical foundation connecting the base model's pre-training objective to alignment goals
Concrete Example: In standard RLHF, to align a model like Llama-3, developers must first collect thousands of human rankings (A > B) to train a separate reward model. This paper argues this external step is redundant because the Llama-3 base model itself already contains the necessary reward signal in its logits.
Key Novelty
Endogenous Reward Extraction via Inverse Soft Bellman Operator
  • Demonstrates that standard next-token prediction (pre-training/SFT) is theoretically equivalent to a specific form of offline Inverse Reinforcement Learning (IRL)
  • Derives a closed-form solution to extract an 'endogenous reward' directly from the language model's logits (interpreted as soft Q-values) without training a separate reward model
  • Proves that fine-tuning the model using this extracted reward reduces the error bound from quadratic O(H^2) (imitation learning) to linear O(H) (reinforcement learning)
Evaluation Highlights
  • The Endogenous Reward method outperforms standard LLM-as-a-judge approaches (like Prometheus) on alignment benchmarks.
  • Reinforcement learning using the endogenous reward surpasses explicit reward models trained on human-labeled data in specific settings.
  • Theoretical proof establishes that RL with endogenous rewards achieves a linear error bound O(H) compared to the quadratic O(H^2) of the base SFT model.
Breakthrough Assessment
9/10
Offers a fundamental theoretical shift by proving reward models are latent in base LLMs, potentially eliminating the need for separate reward modeling stages and expensive preference data.
×