← Back to Paper List

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

Judy Hanwen Shen, Archit Sharma, Jun Qin
Stanford University
arXiv (2024)
RL Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Data-Centric AI
This paper establishes three data-centric metrics—effective sample size, noise invariance, and information content—to characterize RLHF preference datasets, demonstrating that dataset composition often outweighs raw scale for reward model performance.
Core Problem
While RLHF relies heavily on pairwise preference datasets, there is no rigorous methodology to measure or compare these datasets beyond basic summary statistics like token counts.
Why it matters:
  • Ideally, custom preference data is collected for every application, but in practice, practitioners indiscriminately use a few public datasets (e.g., HH-RLHF) without understanding their quality
  • Technical reports for SOTA models highlight data quality importance but provide no details, leaving the community with 'folk wisdom' rather than empirical evidence on what makes a dataset 'good'
  • Indiscriminate scaling of preference data may be inefficient or harmful if the data composition (e.g., domain relevance, noise level) is poor
Concrete Example: A practitioner might assume training on 140k examples from HH-RLHF is better than 10k examples from UltraFeedback due to size. However, the paper shows the 10k UltraFeedback subset actually outperforms the larger dataset on Chat benchmarks, indicating that raw scale is a poor proxy for quality.
Key Novelty
Data-Centric Metrics for Preference Datasets
  • Proposes measuring 'Effective Sample Size' by analyzing performance scaling laws specific to preference datasets, rather than assuming pre-training scaling laws apply
  • Introduces 'Noise Invariance' testing by deliberately flipping preference labels to measure how robust reward models are to annotator disagreement
  • Defines 'Information Content' using cosine similarity between chosen and rejected responses, positing that pairs with highly similar embeddings offer less learning signal
Evaluation Highlights
  • 10k examples from UltraFeedback outperform 140k examples from HH-RLHF on the RewardBench Chat task, showing composition beats scale
  • Reward models are surprisingly robust to label noise, maintaining performance until 30-40% of preference labels are flipped
  • Filtering for 'high information' examples (low cosine similarity between responses) improves evaluation accuracy for smaller models (350M parameters) compared to random sampling
Breakthrough Assessment
7/10
A solid data-centric study that challenges the 'more is better' assumption in RLHF. While it doesn't propose a new architecture, its empirical analysis of dataset properties provides valuable practical guidance for the community.
×