DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

📝 Paper Summary

Reward Modeling Reinforcement Learning from Human Feedback (RLHF) Model Merging

DogeRM improves reward model performance in specialized domains like math and coding by merging a general-purpose reward model with domain-specific supervised fine-tuned models, avoiding the need for costly domain-specific preference data.

Core Problem

Training reward models (RMs) for specific domains like math or coding typically requires expensive, expert-annotated paired preference data, which is scarce and costly to collect.

Why it matters:

Standard RMs trained on general preference data often lack the deep domain knowledge needed to accurately evaluate specialized outputs (e.g., complex code or math proofs)
Collecting domain-specific preference pairs (chosen vs. rejected) is significantly harder and more expensive than collecting standard supervised fine-tuning (SFT) data
Current methods rely on heavy fine-tuning, whereas high-quality domain SFT models are readily available and underutilized for reward signal improvement

Concrete Example: A general reward model might fail to distinguish a subtle bug in a code solution from a correct one because it wasn't trained on enough coding examples. DogeRM merges this RM with a model specifically fine-tuned on code, injecting the necessary expertise to identify the correct solution without needing new preference pairs.

Key Novelty

Domain knowledge merged Reward Model (DogeRM)

Merges the weights of a general-purpose Reward Model (trained on standard preference data) with a Domain-Specific SFT Model (trained on math/code) initialized from the same base
Uses a disjoint merging strategy: separates the model into embedding, transformer, and head layers, applying weighted averaging to shared components while preserving the RM's regression head
Adjusts the interpolation weight (lambda) based on a small validation set to balance general alignment capability with specific domain expertise

Architecture

Illustration of the DogeRM framework. It shows the merging of a 'General Preference RM' and a 'Domain-Specific SFT Model' (e.g., Math/Code) into a single 'Domain knowledge merged RM'.

Evaluation Highlights

+17.0% accuracy improvement on the RewardBench Math subset when merging LLaMA-2 RM with MAmmoTH-7B
+11.4% accuracy improvement on RewardBench Math when merging with MetaMath-7B
+6.0% accuracy improvement on Auto-J Eval Code subset when merging with a custom Code Model

Breakthrough Assessment

7/10

A simple yet effective method that addresses a major bottleneck in RLHF (scarcity of domain preference data) by leveraging abundant SFT models. The gains are significant, though the technique itself (linear merging) is standard.

⚙️ Technical Details

Problem Definition

Setting: Reward Modeling for RLHF, specifically enhancing RMs for domain-specific tasks

Inputs: Input prompt x, chosen response y_c, rejected response y_r

Outputs: Scalar reward score r(x, y)

Pipeline Flow

Group: Merging Phase → Merged Model Inference
Input: General RM weights + Domain SFT weights
Step 1: Embedding Layer Merge (weighted average of common tokens)
Step 2: Transformer Layer Merge (weighted average of all layers)
Step 3: Head Composition (Keep RM regression head)
Output: Merged DogeRM used for scoring

System Modules

Merging Mechanism

Combine weights of the SFT model and RM

Model or implementation: Weighted averaging (Linear interpolation)

Reward Inference

Score generated responses

Model or implementation: DogeRM (Merged LLaMA-2 or Mistral backbone)

Novel Architectural Elements

Disjoint parameter merging strategy: explicitly treating embedding, transformer, and head layers differently (Head is strictly from RM, others are interpolated)

Modeling

Base Model: LLaMA-2-7B (primary experiments), Mistral-7B (generalizability experiments)

Training Method: Standard Reward Modeling (Bradley-Terry model) for the base RM; Standard SFT for the domain models

Objective Functions:

Purpose: Train the base Reward Model to rank chosen responses higher than rejected ones.

Formally: Loss = -log(sigmoid(r(x, y_c) - r(x, y_r)))

Adaptation: Full fine-tuning for SFT and RM training before merging

Trainable Parameters: All parameters (during pre-merge training phases)

Training Data:

RM Training: UltraFeedback dataset
RM Backbone SFT: Alpacafarm 10k split
Domain SFT: MetaMath-7B, MAmmoTH-7B, OSS-Instruct, Magicoder-Evol-Instruct

Key Hyperparameters:

merging_weight_lambda: 0.35 (default for main results), range 0.2-0.5 recommended

Compute: Not reported in the paper

Comparison to Prior Work

vs. Rame et al.: DogeRM merges a Reward Model with an SFT Generative Model, not two Reward Models
vs. Standard RLHF: Does not require domain-specific preference data, only domain SFT data/models
vs. Fine-tuning RM on domain data: DogeRM is a training-free merge (once models exist) and avoids the catastrophic forgetting or overfitting often seen when fine-tuning RMs on small proxy datasets (demonstrated in experiments)

Limitations

Depends on the availability of high-quality SFT models initialized from the same base model as the RM
Performance is sensitive to the merging weight lambda, requiring a validation set to tune
Improvements in reranking (Best-of-N) on MBPP were modest due to the low upper bound of the base model's capabilities
Only explores linear merging; more complex merging techniques (e.g., TIES, Dare) were not extensively compared in the main results

Reproducibility

Code: https://github.com/MiuLab/DogeRM

Code and trained models are released at https://github.com/MiuLab/DogeRM. The paper details the source models (MetaMath, MAmmoTH) and datasets (UltraFeedback, Alpacafarm) used. Merging weights is deterministic given the lambda.

📊 Experiments & Results

Evaluation Setup

Evaluate reward model accuracy on preference pairs and downstream reranking performance

Benchmarks:

RewardBench (Reward Model Evaluation (Chat, Chat Hard, Safety, Reasoning))
Auto-J Eval (Pairwise preference evaluation (Code, Math, Others))
GSM8K (Math Word Problems (evaluated via Best-of-N reranking))
MBPP (Code Generation (evaluated via Best-of-N reranking))

Metrics:

Accuracy (on preference pairs)
Pass@1 (for Best-of-N reranking)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on RewardBench show significant gains in the Reasoning (Math/Code) category when merging with domain-specific models.
RewardBench (Math Subset)	Accuracy	0.5847	0.6987	+0.114
RewardBench (Math Subset)	Accuracy	0.5847	0.7547	+0.170
RewardBench (Code Subset)	Accuracy	0.7259	0.7799	+0.054
Downstream task performance using Best-of-16 sampling on GSM8K demonstrates that the improved reward accuracy translates to better generation selection.
GSM8K (Best-of-16)	Accuracy	0.490	0.540	+0.050
Generalizability results using Mistral-based architecture.
RewardBench (Math Subset)	Accuracy	0.4468	0.7568	+0.310

Experiment Figures

Best-of-N (N=1 to 16) performance curves for GSM8K and MBPP.

Impact of the weighting factor lambda on RewardBench performance.

Main Takeaways

Merging general reward models with domain-specific SFT models significantly boosts performance in those domains (Math/Code) without needing domain preference data.
The method generalizes across architectures (LLaMA-2 and Mistral) and domains.
Fine-tuning the RM on a small validation set (instead of merging) improves performance on that specific set but fails to generalize to other benchmarks, highlighting the robustness of the merging approach.
Merging multiple domain models (e.g., Math + Code + RM) simultaneously improves performance in both domains, though tuning the weights becomes more complex.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Transformer architecture (embeddings, layers, heads)
Model Merging / Weight Interpolation

Key Terms

RM: Reward Model—a model trained to predict a scalar score representing human preference for a given text response

SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs (without preference ranking)

Model Merging: Combining the weights of two or more trained neural networks into a single model, typically via weighted averaging, without additional training

Linear Merging: A specific type of model merging where weights are combined as w_new = (1-λ) * w_A + λ * w_B

Best-of-N sampling: An inference strategy where N responses are generated, scored by a reward model, and the highest-scoring response is selected