← Back to Paper List

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, Tong Zhang
University of Illinois Urbana-Champaign, University of Wisconsin–Madison
arXiv (2024)
RL Benchmark

📝 Paper Summary

Reward Modeling for RLHF Interpretability in LLM Alignment
ArmoRM separates reward modeling into interpretable multi-objective regression followed by a context-aware Mixture-of-Experts gating layer that dynamically weights these objectives to produce a final preference score.
Core Problem
Standard reward models are black boxes that output a single scalar score, making them uninterpretable and prone to reward hacking (e.g., verbosity bias) because humans cannot see which factors drove the decision.
Why it matters:
  • Black-box reward models obscure why an LLM response is preferred, making it hard to diagnose alignment failures like safety violations or hallucination
  • Reward hacking often occurs when models exploit specific biases (like length) that the reward model over-weights, leading to degraded generation quality
  • Existing multi-objective approaches typically use rigid linear combinations, failing to adapt to different contexts (e.g., safety matters more for bomb-making prompts than for math problems)
Concrete Example: A standard reward model might rate a long, incorrect answer higher than a short, correct one due to verbosity bias. Without interpretability, developers cannot see that the model assigned 60% weight to length and 40% to helpfulness. ArmoRM explicitly exposes these weights.
Key Novelty
Absolute-Rating Multi-Objective Reward Model (ArmoRM) with MoE Gating
  • Decomposes the reward signal into specific semantic objectives (e.g., helpfulness, safety, honesty) learned from absolute ratings rather than just binary preferences
  • Uses a learnable Mixture-of-Experts gating network that looks at the prompt and decides how much weight to give each objective (e.g., upweighting 'safety' for dangerous prompts)
  • Applies a penalty adjustment to decouple verbosity from other objectives, explicitly reducing the reward model's bias toward longer responses
Evaluation Highlights
  • Achieves state-of-the-art performance on RewardBench with 8B parameters, outperforming the much larger Nemotron-4 340B reward model
  • Surpasses LLM-as-a-judge (GPT-4) on RewardBench by a considerable margin
  • Significantly outperforms the Llama-3 8B Bradley-Terry baseline (from which it was initialized) across Chat, Safety, and Reasoning categories
Breakthrough Assessment
9/10
Provides a highly effective, interpretable alternative to black-box reward models. It beats GPT-4 and 340B-parameter models using only 8B parameters, representing a significant efficiency and performance jump in RLHF.
×