Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

📝 Paper Summary

Reward Modeling for RLHF Interpretability in LLM Alignment

ArmoRM separates reward modeling into interpretable multi-objective regression followed by a context-aware Mixture-of-Experts gating layer that dynamically weights these objectives to produce a final preference score.

Core Problem

Standard reward models are black boxes that output a single scalar score, making them uninterpretable and prone to reward hacking (e.g., verbosity bias) because humans cannot see which factors drove the decision.

Why it matters:

Black-box reward models obscure why an LLM response is preferred, making it hard to diagnose alignment failures like safety violations or hallucination
Reward hacking often occurs when models exploit specific biases (like length) that the reward model over-weights, leading to degraded generation quality
Existing multi-objective approaches typically use rigid linear combinations, failing to adapt to different contexts (e.g., safety matters more for bomb-making prompts than for math problems)

Concrete Example: A standard reward model might rate a long, incorrect answer higher than a short, correct one due to verbosity bias. Without interpretability, developers cannot see that the model assigned 60% weight to length and 40% to helpfulness. ArmoRM explicitly exposes these weights.

Key Novelty

Absolute-Rating Multi-Objective Reward Model (ArmoRM) with MoE Gating

Decomposes the reward signal into specific semantic objectives (e.g., helpfulness, safety, honesty) learned from absolute ratings rather than just binary preferences
Uses a learnable Mixture-of-Experts gating network that looks at the prompt and decides how much weight to give each objective (e.g., upweighting 'safety' for dangerous prompts)
Applies a penalty adjustment to decouple verbosity from other objectives, explicitly reducing the reward model's bias toward longer responses

Evaluation Highlights

Achieves state-of-the-art performance on RewardBench with 8B parameters, outperforming the much larger Nemotron-4 340B reward model
Surpasses LLM-as-a-judge (GPT-4) on RewardBench by a considerable margin
Significantly outperforms the Llama-3 8B Bradley-Terry baseline (from which it was initialized) across Chat, Safety, and Reasoning categories

Breakthrough Assessment

9/10

Provides a highly effective, interpretable alternative to black-box reward models. It beats GPT-4 and 340B-parameter models using only 8B parameters, representing a significant efficiency and performance jump in RLHF.

⚙️ Technical Details

Problem Definition

Setting: Learning a scalar reward function r(x, y) that predicts human preference, decomposable into k interpretable objectives

Inputs: Prompt x and Response y

Outputs: Scalar reward score s

Pipeline Flow

LLM Backbone (extracts features from prompt+response)
Multi-Objective Regression Layer (predicts k specific scores)
Verbosity Adjustment (penalizes scores correlated with length)
Gating Network (predicts weights based on prompt)
Scalarization (computes weighted sum of scores)

System Modules

LLM Backbone

Process input text to generate hidden states

Model or implementation: Llama-3 8B (initialized from a BT reward model)

Multi-Objective Head

Predict k distinct reward attributes (e.g., helpfulness, safety)

Model or implementation: Linear regression layer W (d x k)

Gating Network

Determine the importance of each objective based on the prompt

Model or implementation: 3-layer MLP (ReLU, 1024 hidden units)

Novel Architectural Elements

Two-stage architecture: Frozen multi-objective regression backbone + Trainable MoE gating network
Prompt-conditional scalarization: The weights for combining rewards are dynamic functions of the prompt x, not fixed hyperparameters

Modeling

Base Model: Llama-3 8B

Training Method: Two-stage training: (1) Multi-objective linear probing (regression), (2) Gating layer training via Bradley-Terry loss

Objective Functions:

Purpose: Train the multi-objective head to predict absolute ratings.

Formally: MSE loss ||r - W^T f(x,y)||^2.
Purpose: Train the gating network to maximize likelihood of preferred responses.

Formally: Bradley-Terry loss L_BT = -log(sigmoid(beta * (s_chosen - s_rejected))).
Purpose: Decorrelate rewards from length.

Formally: r'_i = r_i - lambda_i * r_verbosity, where lambda_i minimizes correlation with length.

Training Data:

Stage 1: 19 objectives from 8 datasets with absolute ratings (e.g., UltraFeedback)
Stage 2: 10 pairwise preference datasets for training the gating layer

Key Hyperparameters:

learning_rate: 0.001 (gating layer)
batch_size: 1024
steps: 10000
+ 3 more
beta_initial: 100
gating_hidden_layers: 3
gating_hidden_units: 1024

Compute: Stage 1 (Regression): CPU (Scikit-learn). Stage 2 (Gating): Single NVIDIA A6000 GPU.

Comparison to Prior Work

vs. HelpSteer/DPA: ArmoRM uses a prompt-dependent MoE gating layer to dynamically weight objectives, rather than fixed weights
vs. Standard BT RMs: ArmoRM provides interpretability by exposing the specific objective scores (e.g., 'Safety: 0.9, Helpfulness: 0.2') and their weights
vs. Nemotron-4 340B: ArmoRM achieves comparable performance with only 8B parameters [cited in paper]

Limitations

Requires datasets with fine-grained absolute ratings (multi-dimensional annotations), which are scarcer than binary preference pairs
Interpretability is limited to the defined objectives; if a crucial dimension (e.g., 'sarcasm') is missing from the 19 objectives, the model cannot explicitly reason about it
Relies on a fixed backbone feature extractor; the backbone itself is not fine-tuned during the MoE stage

Reproducibility

Code: https://github.com/RLHFlow/RLHF-Reward-Modeling

Code and model weights (ArmoRM-Llama3-8B) are publicly available at https://github.com/RLHFlow/RLHF-Reward-Modeling. Training relies on saving features locally to train the gating layer efficiently. Datasets used are open-source (UltraFeedback, etc.).

📊 Experiments & Results

Evaluation Setup

Evaluation on RewardBench, a comprehensive benchmark for reward models

Benchmarks:

RewardBench (Reward Model Evaluation (Chat, Chat Hard, Safety, Reasoning))

Metrics:

Accuracy (pairwise preference accuracy)
Spearman correlation (for verbosity analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ArmoRM outperforms baseline methods and approaches the performance of much larger models on RewardBench.
RewardBench	Overall Score	84.7	89.0	+4.3
RewardBench	Overall Score	88.3	89.0	+0.7
RewardBench	Overall Score	92.0	89.0	-3.0

Main Takeaways

MoE gating significantly improves over static linear combination of rewards, confirming that different prompts require different reward compositions
Disentangling verbosity from other objectives effectively mitigates length bias, a common issue in RLHF
The two-stage training (frozen backbone + lightweight gating) is highly efficient, requiring only a single GPU for the second stage
8B parameter models can rival proprietary models (GPT-4) and massive open models (340B) when specialized for reward modeling with dense supervision

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry (BT) model for pairwise preferences
Mixture-of-Experts (MoE) architectures
Multi-objective optimization/regression

Key Terms

ArmoRM: Absolute-Rating Multi-Objective Reward Model—the proposed architecture that predicts multiple specific reward scores (helpfulness, safety, etc.) instead of one generic score

MoE: Mixture-of-Experts—a neural network architecture where different 'experts' (here, reward objectives) are weighted dynamically by a gating network

Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another based on their score difference

RLHF: Reinforcement Learning from Human Feedback—fine-tuning LLMs using a reward model trained on human preferences

verbosity bias: The tendency of reward models (and consequently aligned LLMs) to prefer longer responses regardless of their actual quality

RewardBench: A benchmark designed to evaluate reward models on various capabilities like chat, reasoning, and safety

gating network: A small neural network that takes the input context and outputs weights (summing to 1) to combine the multi-objective reward scores