← Back to Paper List

Reward Models are Metrics in a Trench Coat

Sebastian Gehrmann
arXiv (2025)
RL Benchmark Factuality

📝 Paper Summary

Reward Modeling Evaluation Metrics RLHF (Reinforcement Learning from Human Feedback)
Reward models and evaluation metrics are functionally identical tools separated by arbitrary research barriers; integrating them would solve shared failures like reward hacking and poor data quality.
Core Problem
Research on Reward Models (used for training) and Evaluation Metrics (used for testing) operates in isolation despite identical goals, leading to redundant terminology and missed opportunities for cross-pollination.
Why it matters:
  • Reinforcement learning models suffer from reward hacking (optimizing spurious correlations), a problem already studied in metric evaluation as 'gaming the metric'
  • Current reward models often fail on difficult tasks (e.g., translation nuances) where established evaluation metrics already excel
  • Lack of collaboration leads to inconsistent definitions for concepts like 'hallucination' and 'attribution' across the two fields
Concrete Example: In the RewardBench-M benchmark, reward models perform perfectly on easy translation pairs but struggle on hard ones. An older, smaller metric (CometKiwi, 550M params) outperforms these specialized reward models on Chinese translation tasks, yet was previously ignored by the RM community.
Key Novelty
Unifying the fields of Reward Modeling and Evaluation Metrics
  • Empirically demonstrates the scientific disconnect via citation analysis (showing <10% overlap) and terminology mapping (e.g., 'segment-level meta-evaluation' matches 'DPO training signal')
  • Proves that 'traditional' evaluation metrics can serve as superior reward models for specific domains (like translation)
  • Demonstrates that current Reward Model techniques (LLM-as-a-judge) lag behind dedicated metrics in areas like factuality/attribution
Evaluation Highlights
  • Inter-field citations between Reward Models and Evaluation Metrics account for fewer than 10% of total citations, quantifying the severe isolation
  • CometKiwi (a 2022 metric) outperforms state-of-the-art Reward Models on the Chinese subset of the RewardBench-M benchmark despite being significantly smaller
  • Dedicated factuality metrics outperform LLM-as-a-judge approaches (including GPT-5 and Gemini 2.5 Pro) on the SEAHORSE summarization attribution dataset
Breakthrough Assessment
8/10
A strong position paper that empirically validates a critical inefficiency in the field. It challenges the distinct identity of 'Reward Models' and provides concrete evidence that cross-pollination yields immediate SOTA improvements.
×