Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

📝 Paper Summary

AI Safety and Alignment Reward Modeling Sociodemographic Bias

The paper formalizes a framework to quantify sociodemographic biases in Reward Models, finding they consistently favor specific demographics and cannot be reliably steered via in-context prompting.

Core Problem

Reward Models (RMs) act as proxies for human values in alignment training, but their inherent sociodemographic biases and opinion distributions remain opaque and unmeasured.

Why it matters:

RMs define the optimization landscape for Language Models (LMs); if the RM is biased, the resulting LM will inevitably propagate those social biases
Current evaluations focus on LMs (which suffer from refusals and instability), neglecting the upstream RMs that actually drive preference learning
Blindly relying on RMs for safety and alignment risks reinforcing stereotypes if the models implicitly favor harmful or non-representative perspectives

Concrete Example: When evaluated on the OpinionQA dataset, RMs consistently assigned higher rewards to opinions held by people from the American South with lower formal education, rather than representing a balanced global perspective. Furthermore, BeaverRM consistently preferred 'Stereotyped' choices in the BBQ dataset, while Pythia1B preferred 'Unknown' options.

Key Novelty

Reward Model Perspectives (RMPs) Framework

Audits alignment by treating the Reward Model (RM) as a classifier: feeding it multiple-choice survey questions and interpreting the output rewards as an opinion probability distribution
Decouples alignment measurement from text generation, bypassing issues like model refusals or formatting errors common in Language Model evaluations
Introduces metrics to compare RM opinion distributions against specific human demographic groups using distance functions like Jensen-Shannon and Wasserstein

Evaluation Highlights

Steering via prompting (Bio, Portray, QA) failed to meaningfully shift RM opinions, with negligible effect sizes (e.g., 0.086 for Bio steering) compared to un-steered baselines
High consistency in relative bias: RMs showed a Spearman's rank correlation of 0.67 across demographic preferences, indicating they favor the same groups regardless of architecture
Absolute alignment varies by model capability: Pythia7BRM achieved 0.930 alignment with overall human respondents on PRISM, whereas BeaverRM reached only 0.732

Breakthrough Assessment

8/10

Significant methodological contribution by shifting scrutiny from LMs to RMs. The finding that RMs are essentially 'unsteerable' via prompting is a critical negative result for safety alignment.

⚙️ Technical Details

Problem Definition

Setting: Quantifying the distance between a Reward Model's induced opinion distribution and a target human demographic's opinion distribution

Inputs: Survey question q, set of answer choices C, and a target demographic group G

Outputs: Alignment score A(D_M, D_G) in [0, 1]

Pipeline Flow

Data Formatting (Convert survey data to Q&A pairs)
Reward Scoring (RM scores each Q&A pair)
Distribution Construction (Softmax over scores)
Alignment Calculation (Distance between RM and Human distributions)

System Modules

Data Formatter

Converts datasets (OpinionQA, PRISM, BBQ) into (Question, Choice) tuples

Model or implementation: Rule-based scripts

Reward Scorer

Computes a scalar reward for each answer choice given the question

Model or implementation: Various RMs (e.g., UltraRM, StarlingRM, BeaverRM)

Distribution Normalizer (Analysis)

Converts scalar rewards into a probability distribution over choices

Model or implementation: Softmax Function

Metric Calculator (Analysis)

Computes alignment between Model Distribution and Human Group Distribution

Model or implementation: Alignment Metric A(D1, D2)

Novel Architectural Elements

Application of softmax normalization directly to RM scalar outputs to simulate 'choice probabilities' for survey questions, treating the RM as a multi-choice classifier

Modeling

Base Model: Evaluation of 7 existing RMs: BeaverRM, LLMBlenderRM, StarlingRM, UltraRM, DeBERTaRM, Pythia1bRM, Pythia7bRM

Reproducibility

Code: https://github.com/socialnlp/rmp

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation of RMs on multiple-choice survey and bias datasets

Benchmarks:

OpinionQA (Public opinion survey (ordinal))
PRISM (Demographic preference conversation)
BBQ (Bias and Stereotype detection)
StereoSet (Stereotype detection)

Metrics:

Alignment Metric (1 - Normalized Distance)
Friedman Test Statistic
Spearman's Rank Correlation
Effect Size (Wilcoxon signed-rank test)
Statistical methodology: Friedman test for group rank differences; Wilcoxon signed-rank test for steering effects; Benjamini-Hochberg correction for multiple tests

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Absolute alignment scores on PRISM show that model size/choice dominates alignment quality over demographic factors.
PRISM	Alignment (JSD-based)	0.732	0.930	+0.198
Steerability analysis reveals that in-context prompting has statistically negligible effects on RM behavior.
OpinionQA	Effect Size (Wilcoxon)	0.0	0.086	+0.086
OpinionQA	Effect Size (Wilcoxon)	0.0	0.148	+0.148
Relative alignment analysis shows models strongly agree on which demographic groups they prefer, regardless of model architecture.
OpinionQA	Spearman's Rank Correlation	0	0.67	+0.67
OpinionQA	Friedman Test Statistic	0	295.7	+295.7

Experiment Figures

Average rank of alignment across demographic groups in OpinionQA

Standard deviations of alignment values across steering prompts

Main Takeaways

Absolute alignment is driven by the specific RM chosen (better models align better with everyone), but relative alignment is structurally biased: RMs consistently favor the American South / lower education demographics over others.
Steering via prompting (Bio, Portray, QA) is ineffective for RMs; unlike LMs, RMs do not meaningfully shift their reward distributions based on context.
Stereotype behavior is inconsistent across models: some (UltraRM) prefer stereotypes, others (BeaverRM) prefer refusals or non-stereotypes, suggesting safety profiles must be audited per-model.
Smaller RMs (e.g., Pythia1B) often default to 'Unknown' or random preferences in bias benchmarks, likely due to a lack of semantic capability to understand the stereotype.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Reward Modeling / Bradley-Terry Model
Probability Distributions (Softmax)
Statistical Distance Metrics (KL Divergence, Wasserstein)

Key Terms

RM: Reward Model—a model trained to predict human preferences (usually a scalar score) between two pieces of text

RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune language models using a reward model trained on human preference data

Bradley-Terry Model: A probability model used to predict the outcome of a paired comparison; here, it converts scalar rewards into probabilities of one answer being preferred over another

Jensen-Shannon Distance: A symmetric metric measuring the similarity between two probability distributions (used here for non-ordinal opinions)

Wasserstein Distance: A distance function between probability distributions defined on a metric space (used here for ordinal opinions, treating them as points on a line)

Opinion Distribution: A probability distribution over possible answers to a question, derived by applying a softmax function to the RM's reward scores for each answer

Steerability: The ability to change a model's behavior or opinions by providing context or instructions in the prompt (e.g., 'Answer as a liberal')

Friedman test: A non-parametric statistical test used to detect differences in treatments (here, differences in alignment ranks between demographic groups)