Evaluation Setup
Dataset Quality Analysis and Reward Model Performance
Benchmarks:
- Reward-Bench (Reward Model Evaluation)
Metrics:
- Weighted Cohen's Kappa (Inter-annotator agreement)
- Pearson's R (Attribute correlation)
- Reward-Bench Score (primary dataset)
- Statistical methodology: Quadratic weighted Cohen's Kappa used for ordinal attribute agreement.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Reward-Bench |
Score |
Not reported in the paper |
92.0 |
Not reported in the paper
|
| Annotation quality analysis showing improvements in inter-annotator agreement after applying strict guidelines and filtering. |
| Internal Annotation |
Cohen's Kappa (Helpfulness) |
0.465 |
0.706 |
+0.241
|
| Internal Annotation |
Cohen's Kappa (Correctness) |
0.472 |
0.715 |
+0.243
|
| Correlation analysis reveals shifting importance of attributes between HelpSteer1 and HelpSteer2. |
| HelpSteer2 vs HelpSteer |
Pearson's R (Coherence vs Helpfulness) |
0.6348 |
0.4979 |
-0.1369
|
| HelpSteer2 vs HelpSteer |
Pearson's R (Correctness vs Helpfulness) |
0.8525 |
0.9430 |
+0.0905
|
Main Takeaways
- Strict filtering of annotators (retaining only those with high agreement) is crucial for creating high-signal reward datasets.
- As base models improve, 'Coherence' becomes a solved problem and correlates less with overall quality, while 'Correctness' becomes the primary differentiator.
- Complexity and Verbosity have low correlation with Helpfulness in HelpSteer2 (0.18 and 0.06), indicating the dataset successfully disentangles style from quality.
- High-quality data (HelpSteer2) allows for SOTA reward modeling with significantly fewer samples (10k) than noisy large-scale datasets.