HelpSteer2: Open-source dataset for training top-performing reward models

📝 Paper Summary

Reward Modeling Preference Datasets RLHF

HelpSteer2 is a permissively licensed, multi-attribute preference dataset that enables training state-of-the-art reward models using only 10,000 high-quality, human-annotated prompt-response pairs.

Core Problem

Existing permissively licensed preference datasets (e.g., HH-RLHF) are outdated, while high-quality synthetic datasets often carry restrictive commercial licenses preventing their use in proprietary model development.

Why it matters:

Training aligned LLMs requires high-quality preference data, but proprietary models (GPT-4) restrict the commercial use of their outputs.
Current open datasets like Open Assistant lack the quality and detailed attribute labeling needed for modern SOTA reward modeling.
Lack of transparency in training data for models like Llama 3 hinders community reproduction of alignment techniques.

Concrete Example: While Llama 2 utilized over 1 million binary comparisons, HelpSteer2 achieves high performance with just 10k samples by using detailed 5-attribute scoring (e.g., separating 'Verbosity' from 'Helpfulness') rather than simple binary choices.

Key Novelty

Dense Multi-Attribute Human Annotation with Strict Agreement Filtering

Replaces simple binary 'better/worse' labels with 5 specific attributes (Helpfulness, Correctness, Coherence, Complexity, Verbosity) on a 5-point Likert scale.
Enforces high data quality by requiring 3+ annotators per sample and discarding any data where annotator disagreement exceeds 2 points.
Collects responses from a diverse mix of models (Nemotron-2/3/4, Mixtral) to ensure broad coverage of response styles.

Evaluation Highlights

Achieves SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming listed open and proprietary models as of June 2024.
Dataset efficiency: Uses only ~10,000 response pairs, an order of magnitude fewer than HH-RLHF (~160k), to achieve top performance.
Inter-annotator agreement for 'Helpfulness' improved to 0.706 (Cohen's Kappa) through strict quality control, compared to 0.465 in initial collection.

Breakthrough Assessment

8/10

Highly significant contribution due to the release of a commercially viable (CC-BY-4.0), SOTA-enabling dataset. It demonstrates that data quality and density of signal (attributes) outweigh massive scale.

⚙️ Technical Details

Problem Definition

Setting: Training Reward Models for LLM Alignment

Inputs: Prompt x and Model Response y

Outputs: Scalar rewards corresponding to human preference attributes (e.g., Helpfulness score)

Modeling

Base Model: Internal base model (likely Nemotron-family based on context, but explicitly 'powerful internal base model' in text)

Training Data:

21,362 total samples (10,681 prompts * 2 responses).
Prompts: 95% ShareGPT (user turns only), 5% proprietary (enterprise use cases).
Responses: Nemotron-2 (18.9%), Nemotron-3 (40.4%), Nemotron-4 (26.9%), Mixtral-8x7B (7.9%), Human (5.9%).
Splits: 95% Training, 5% Validation.

Comparison to Prior Work

vs. HH-RLHF: HelpSteer2 uses 5-point Likert scales for multiple attributes instead of binary choice, and is much smaller (10k vs 160k pairs).
vs. Ultrafeedback/Nectar: HelpSteer2 is human-annotated and CC-BY-4.0 licensed, avoiding legal risks associated with training on proprietary model outputs.

Limitations

Dataset size is relatively small (10k prompts) compared to massive datasets like HH-RLHF.
Relies heavily on ShareGPT for prompts, which may bias towards chatbot-style interactions.
Requires ground truth or strong internal models for the 'Correctness' attribute, which can be subjective.

Reproducibility

Code: https://github.com/NVIDIA/NeMo-Aligner

Dataset is publicly available on HuggingFace (nvidia/HelpSteer2). Code for alignment (NeMo-Aligner) is on GitHub. The 'internal base model' used to achieve the Reward-Bench score is not publicly released in this paper.

📊 Experiments & Results

Evaluation Setup

Dataset Quality Analysis and Reward Model Performance

Benchmarks:

Reward-Bench (Reward Model Evaluation)

Metrics:

Weighted Cohen's Kappa (Inter-annotator agreement)
Pearson's R (Attribute correlation)
Reward-Bench Score (primary dataset)
Statistical methodology: Quadratic weighted Cohen's Kappa used for ordinal attribute agreement.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reward-Bench	Score	Not reported in the paper	92.0	Not reported in the paper
Annotation quality analysis showing improvements in inter-annotator agreement after applying strict guidelines and filtering.
Internal Annotation	Cohen's Kappa (Helpfulness)	0.465	0.706	+0.241
Internal Annotation	Cohen's Kappa (Correctness)	0.472	0.715	+0.243
Correlation analysis reveals shifting importance of attributes between HelpSteer1 and HelpSteer2.
HelpSteer2 vs HelpSteer	Pearson's R (Coherence vs Helpfulness)	0.6348	0.4979	-0.1369
HelpSteer2 vs HelpSteer	Pearson's R (Correctness vs Helpfulness)	0.8525	0.9430	+0.0905

Main Takeaways

Strict filtering of annotators (retaining only those with high agreement) is crucial for creating high-signal reward datasets.
As base models improve, 'Coherence' becomes a solved problem and correlates less with overall quality, while 'Correctness' becomes the primary differentiator.
Complexity and Verbosity have low correlation with Helpfulness in HelpSteer2 (0.18 and 0.06), indicating the dataset successfully disentangles style from quality.
High-quality data (HelpSteer2) allows for SOTA reward modeling with significantly fewer samples (10k) than noisy large-scale datasets.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Reward Modeling
Likert scale annotation

Key Terms

SteerLM: A model alignment approach that conditions generation on specific attribute scores (e.g., helpfulness, verbosity) to control response style.

Cohen's Kappa: A statistic used to measure inter-rater reliability (agreement) for qualitative items, accounting for chance agreement.

Likert Scale: A rating scale (e.g., 1 to 5) used to measure opinions or behaviors.

SFT: Supervised Fine-Tuning—training a model on labeled examples.

ShareGPT: A dataset of conversations shared by users of ChatGPT, used here as a source of prompts.