Evaluation Setup
RLHF fine-tuning on OpenAssistant data, evaluated on LIMA test set
Benchmarks:
- LIMA Test Set (Open-ended instruction following)
- TruthfulQA (Factuality/Truthfulness)
- MMLU (Multi-task knowledge)
Metrics:
- Win Score (GPT-4 evaluation against SFT baseline)
- Average Response Length
- Pearson/Kendall/Spearman correlation of Reward with Length
- Statistical methodology: Pareto front analysis (Win Score vs Length trade-off)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| OpenAssistant Test Set |
Pearson Correlation (Reward vs Length) |
0.451 |
-0.03 |
-0.481
|
| OpenAssistant Test Set |
Validation Accuracy |
70.1 |
69.2 |
-0.9
|
| TruthfulQA (mc1) |
Accuracy |
33.90 |
34.64 |
+0.74
|
| MMLU |
Accuracy |
49.87 |
49.74 |
-0.13
|
Main Takeaways
- ODIN consistently achieves a higher Pareto front than baselines: for any given response length, ODIN-trained policies achieve higher quality scores.
- Standard RL tricks like reward clipping and length penalty require extensive tuning and are less effective than disentangling the reward signal at the source.
- The rank correlation (Spearman/Kendall) with length is also eliminated (-0.05 to 0.00), even though the model was trained only on linear Pearson correlation.
- Human evaluation confirms GPT-4 findings: ODIN policies are preferred over vanilla policies at matching length scales.