Reward Models are Metrics in a Trench Coat

📝 Paper Summary

Reward Modeling Evaluation Metrics RLHF (Reinforcement Learning from Human Feedback)

Reward models and evaluation metrics are functionally identical tools separated by arbitrary research barriers; integrating them would solve shared failures like reward hacking and poor data quality.

Core Problem

Research on Reward Models (used for training) and Evaluation Metrics (used for testing) operates in isolation despite identical goals, leading to redundant terminology and missed opportunities for cross-pollination.

Why it matters:

Reinforcement learning models suffer from reward hacking (optimizing spurious correlations), a problem already studied in metric evaluation as 'gaming the metric'
Current reward models often fail on difficult tasks (e.g., translation nuances) where established evaluation metrics already excel
Lack of collaboration leads to inconsistent definitions for concepts like 'hallucination' and 'attribution' across the two fields

Concrete Example: In the RewardBench-M benchmark, reward models perform perfectly on easy translation pairs but struggle on hard ones. An older, smaller metric (CometKiwi, 550M params) outperforms these specialized reward models on Chinese translation tasks, yet was previously ignored by the RM community.

Key Novelty

Unifying the fields of Reward Modeling and Evaluation Metrics

Empirically demonstrates the scientific disconnect via citation analysis (showing <10% overlap) and terminology mapping (e.g., 'segment-level meta-evaluation' matches 'DPO training signal')
Proves that 'traditional' evaluation metrics can serve as superior reward models for specific domains (like translation)
Demonstrates that current Reward Model techniques (LLM-as-a-judge) lag behind dedicated metrics in areas like factuality/attribution

Evaluation Highlights

Inter-field citations between Reward Models and Evaluation Metrics account for fewer than 10% of total citations, quantifying the severe isolation
CometKiwi (a 2022 metric) outperforms state-of-the-art Reward Models on the Chinese subset of the RewardBench-M benchmark despite being significantly smaller
Dedicated factuality metrics outperform LLM-as-a-judge approaches (including GPT-5 and Gemini 2.5 Pro) on the SEAHORSE summarization attribution dataset

Breakthrough Assessment

8/10

A strong position paper that empirically validates a critical inefficiency in the field. It challenges the distinct identity of 'Reward Models' and provides concrete evidence that cross-pollination yields immediate SOTA improvements.

⚙️ Technical Details

Problem Definition

Setting: Meta-evaluation and comparative analysis of two research fields: Reward Modeling and Evaluation Metrics

Inputs: Generated text sequences and human preference labels (or ground truth references)

Outputs: Quality scores or preference rankings

Pipeline Flow

Citation Analysis (Bibliometric Study)
Metric-as-Reward-Model Experiment (Translation)
Reward-Model-as-Metric Experiment (Summarization)

System Modules

Citation Graph Analyzer

Quantify research overlap between fields

Model or implementation: Semantic Scholar Graph API

Translation Evaluator (Experiments)

Test if evaluation metrics work as reward models

Model or implementation: CometKiwi (550M params)

Factuality Judge (Experiments)

Test if reward model techniques (LLM judges) work as metrics

Model or implementation: Gemini 2.5 Pro, GPT-5, Llama-3 (various sizes)

Novel Architectural Elements

Cross-application framework: Systematically applying tools from Field A (Metrics) to benchmarks of Field B (Reward Models) and vice versa to demonstrate functional equivalence

Modeling

Base Model: Various (CometKiwi, Gemini 2.5 Pro, GPT-5)

Training Method: Inference-only evaluation (Zero-shot prompting for LLMs, Inference for CometKiwi)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RewardBench: This paper argues RewardBench should include traditional metrics as baselines, showing they outperform SOTA RMs on specific subsets
vs. Standard Meta-Evaluation: Explicitly links 'segment-level meta-evaluation' to 'DPO pairwise training', unifying the vocabulary

Limitations

Citation analysis relies on keyword matching ('signaling terms') which may miss conceptual overlaps not using specific terminology
Experiment 2 (Factuality) excludes WikiLingua subset due to data availability, potentially affecting generalizability
The paper analyzes correlations and accuracy but does not train a new unified model to solve the proposed problems
LLM-as-a-judge results might be influenced by annotation artifacts in the SEAHORSE dataset

Reproducibility

The paper uses public benchmarks (RewardBench-M, SEAHORSE) and the Semantic Scholar API. Code for the analysis is not provided. Specific prompts for LLM judges were 'same instructions provided to human annotators' but not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Cross-evaluation of models on benchmarks designed for the 'opposing' field

Benchmarks:

RewardBench-M (Translation Preference (identifying better translation))
SEAHORSE (Summarization Attribution/Factuality)

Metrics:

Citation Overlap Percentage
Accuracy (Preference prediction)
Pearson Correlation (Attribution scoring)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Bibliometric analysis quantifies the extreme separation between the two research communities.
Semantic Scholar Graph (2021-2025)	Inter-field citation %	40.0	10.0	-30.0
Inter-rater agreement analysis shows high model-model agreement despite lower accuracy against ground truth.
SEAHORSE	Agreement Rate	73.0	89.0	+16.0

Experiment Figures

The number of papers per year containing specific strings ('Evaluation Metric', 'Reward Model', 'LLM-as-a-judge') from 2021 to 2025.

Citation graph analysis showing the percentage of citations that flow within a field vs. between fields.

Distribution of publication venues for cited papers (e.g., ML venues vs NLP venues).

Main Takeaways

Bibliometric analysis confirms the fields are siloed: <10% of citations cross between Reward Modeling and Evaluation Metrics.
Metrics can be SOTA Reward Models: CometKiwi (a metric) matches or outperforms specialized Reward Models on the RewardBench-M translation task.
Reward Model techniques (LLM-as-a-judge) struggle as Metrics: On SEAHORSE (factuality), dedicated metrics outperform even advanced LLMs like GPT-5 and Gemini 2.5 Pro.
The fields use different terms for identical concepts (e.g., 'segment-level meta-evaluation' is equivalent to 'preference optimization signal'), impeding knowledge transfer.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Reinforcement Learning from Human Feedback (RLHF)
Familiarity with text generation metrics (BLEU, ROUGE, BERTScore, COMET)
Knowledge of LLM-as-a-judge paradigms

Key Terms

Reward Hacking: When a model optimizes for a flaw in the reward function (e.g., using specific words to get a high score) rather than learning the intended task

Goodhart's Law: The principle that when a measure becomes a target, it ceases to be a good measure

LLM-as-a-judge: Using a Large Language Model to evaluate the output of another model, often via prompting

DPO: Direct Preference Optimization—an algorithm that optimizes language models to match preferences without an explicit reward model

PPO: Proximal Policy Optimization—an RL algorithm commonly used to train language models using a reward model

Spurious Correlations: Patterns in data that a model learns which are not actually causal to the task (e.g., length bias, where longer answers are assumed better)

Meta-evaluation: The process of evaluating the evaluators (metrics or reward models) by correlating their scores with human judgments

Exposure Bias: The mismatch between training (where models see ground truth) and testing (where models generate their own history)

CometKiwi: A learned evaluation metric for machine translation based on the InfoXLM model