Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

📝 Paper Summary

Reward Modeling RLHF (Reinforcement Learning from Human Feedback) Data Curation

Skywork-Reward demonstrates that a small, high-quality, publicly sourced dataset (80K pairs) combined with standard Bradley-Terry loss yields state-of-the-art reward models, outperforming models trained on much larger datasets.

Core Problem

Open-source preference datasets for training reward models are often noisy, inconsistent, or excessively large, leading to suboptimal model alignment.

Why it matters:

High-quality reward models are critical for aligning LLMs via RLHF, acting as evaluators for fine-tuning and deployment
Current approaches often rely on massive, noisy datasets (700K to millions of samples) or proprietary internal data, hindering reproducibility and efficiency
Inconsistently labeled preference pairs can degrade reward model performance by introducing conflicting signals

Concrete Example: In the Magpie dataset, responses from smaller models (Llama-3-8B) sometimes receive higher ArmoRM scores than those from larger models (Llama-3-70B) due to bias. Without filtering or adjustment, training on these noisy labels confuses the reward model about true response quality.

Key Novelty

Data-Centric Reward Modeling Curation (Skywork-Reward)

Constructs a lightweight dataset (80K pairs) solely from public sources by aggressively filtering for high-quality, hard samples (e.g., math/code focus) and adversarial safety examples
Implements a two-stage filtering process for adversarial safety data: first removing easy pairs the model already gets right, then keeping only hard pairs the model handles correctly to balance safety and general performance
Demonstrates that simple Bradley-Terry loss outperforms complex margin-based or focal loss variants when data quality is sufficiently high

Architecture

Composition of the Skywork-Reward-Preference-80K dataset

Evaluation Highlights

Skywork-Reward-Gemma-27B achieves 1st place on the RewardBench leaderboard (as of Oct 2024), outperforming proprietary and much larger models
Skywork-Reward-Llama-3.1-8B achieves 7th place on RewardBench, demonstrating high efficiency for its size
The curated 80K dataset achieves these results despite being <12% the size of typical aggregation datasets (e.g., 700K+ samples used in prior work)

Breakthrough Assessment

8/10

Significant for achieving SOTA on RewardBench with a fully open, reproducible, and small (80K) dataset, challenging the 'more data is better' trend and validating standard loss functions over complex ones.

⚙️ Technical Details

Problem Definition

Setting: Pairwise ranking of LLM responses based on human/synthetic preferences

Inputs: Prompt x, chosen response y_c, rejected response y_r

Outputs: Scalar reward scores r(x, y_c) and r(x, y_r) such that r(x, y_c) > r(x, y_r)

Pipeline Flow

Data Collection (Public Sources)
Data Filtering & Selection (Magpie, WildGuardMix)
Reward Model Training (Gemma-27B / Llama-3.1-8B)

System Modules

Data Selector

Filter and curate preference pairs from Magpie, HelpSteer2, WildGuardMix, etc.

Model or implementation: Rules based on ArmoRM scores and model origins

Reward Model

Assign scalar rewards to prompt-response pairs

Model or implementation: Gemma-2-27b-it or Llama-3.1-8B-Instruct

Novel Architectural Elements

Specific data mixture strategy: Prioritizing math/code (top 30%) and adjusting ArmoRM scores to enforce hierarchy between strong (70B) and weak (8B) generator models
Two-stage adversarial filtering for WildGuardMix: Excluding easy non-adversarial pairs, then keeping only adversarial pairs the model actively classifies correctly to maintain safety without degrading general capability

Modeling

Base Model: Gemma-2-27b-it and Llama-3.1-8B-Instruct

Training Method: Discriminative Reward Modeling

Objective Functions:

Purpose: Maximize the likelihood of the preferred response having a higher score than the rejected one.

Formally: L = -log(sigmoid(r_theta(x, y_c) - r_theta(x, y_r)))

Training Data:

Skywork-Reward-Preference-80K (curated from HelpSteer2, OffsetBias, WildGuardMix, Magpie series)

Key Hyperparameters:

learning_rate: 5e-6 for Gemma-27B, 1e-5 for Llama-3.1-8B
batch_size: 128
epochs: 1
+ 2 more
max_length: 4096
learning_rate_scheduler: cosine (warmup ratio 0.03)

Compute: Not reported in the paper

Comparison to Prior Work

vs. InternLM2-Reward: Skywork uses orders of magnitude less data (80K vs 2.4M) and standard BT loss
vs. Nemotron/ArmoRM: Skywork achieves top performance without complex multi-dimensional or gating architectures, relying instead on data quality
vs. General Trends: Skywork prioritizes data selection (discarding 80%+ of source data) over data scale

Limitations

Reliance on ArmoRM for filtering introduces potential bias from that specific reward model
Manual score adjustment (subtracting 0.1/0.05) for Magpie subsets is heuristic and may not generalize
No detailed computational cost or training time reported
Experimental ablation on loss functions showed no benefit for advanced losses, which might be specific to this data distribution

Reproducibility

Code: https://huggingface.co/collections/Skywork/skywork-reward-model-66d7fbdebae0e60d00a6b60d

Highly reproducible. Both the curated dataset (Skywork-Reward-Preference-80K) and the trained models (Skywork-Reward-Gemma-27B, Skywork-Reward-Llama-3.1-8B) are publicly available on HuggingFace. Code is also available.

📊 Experiments & Results

Evaluation Setup

Evaluation on standard reward modeling benchmarks

Benchmarks:

RewardBench (Pairwise preference ranking across Chat, Chat-Hard, Safety, and Reasoning)

Metrics:

Overall Score (weighted average)
Chat
Chat-Hard
Safety
Reasoning
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RewardBench	Overall Score	90.9	92.2	+1.3
RewardBench	Overall Score	92.0	92.2	+0.2
RewardBench	Chat-Hard	78.7	83.6	+4.9
RewardBench	Overall Score	86.1	88.7	+2.6

Experiment Figures

Distribution of reward scores in the Magpie dataset before and after adjustment

Main Takeaways

Standard Bradley-Terry loss consistently outperforms complex variants (Focal, Hinge, Margin) on high-quality data
Filtering data to a small, high-quality subset (80K) is more effective than using larger, noisier mixtures (378K+)
Adversarial safety data requires careful filtering; including 'easy' safety pairs can degrade general performance, while keeping only 'hard' pairs maintains safety without capability loss
Adjusting synthetic data scores (Magpie) to respect model size hierarchy (70B > 8B) corrects labelling noise and improves downstream reward model accuracy

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
Reward Modeling evaluation metrics (RewardBench)

Key Terms

Bradley-Terry (BT) model: A statistical model that estimates the probability of one item being preferred over another based on the difference in their underlying reward scores

RLHF: Reinforcement Learning from Human Feedback—a technique to align LLMs with human intent using a reward model to guide generation

DPO: Direct Preference Optimization—a method to optimize policies directly from preferences without an explicit reward model

ArmoRM: A specific open-source reward model used in this paper to score and filter synthetic data pairs

Magpie: A synthetic data generation method that leverages LLMs' autoregressive nature to generate user queries and responses

Adversarial examples: Inputs designed to trick a model; in this context, harmful prompts paired with compliance (bad) vs. refusal (good) responses

Focal loss: A loss function that down-weights easy examples to focus training on hard-to-classify examples

Hinge loss: A loss function that penalizes the model only if the margin between correct and incorrect class scores is less than a threshold

HelpSteer2: A compact, high-quality preference dataset with multi-attribute annotations like helpfulness and correctness