Evaluation Setup
Train T5 models using PPO with reward models of varying accuracy levels, then evaluate final outputs using a separate, high-quality 'Oracle' reward model.
Benchmarks:
- QA-FEEDBACK (Relevance) (Long-form QA Relevance)
- QA-FEEDBACK (Factuality) (Long-form QA Factuality)
- QA-FEEDBACK (Completeness) (Long-form QA Completeness)
Metrics:
- LM Performance (Score from Oracle Reward Models)
- KL Divergence
- Reward Mean and Variance
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results across all three tasks (Relevance, Factuality, Completeness) consistently show that peak Language Model (LM) performance is achieved with Reward Models (RM) of moderate accuracy, rather than the highest accuracy. |
| QA-FEEDBACK (Relevance) |
LM Performance Score |
0.55 |
0.65 |
+0.10
|
| QA-FEEDBACK (Factuality) |
LM Performance Score |
0.80 |
0.95 |
+0.15
|
| QA-FEEDBACK (Completeness) |
LM Performance Score |
0.05 |
0.80 |
+0.75
|
Main Takeaways
- Moderate reward models provide higher reward variance and mean scores compared to highly accurate ones, encouraging exploration.
- Highly accurate reward models tend to be 'conservative,' often giving lower rewards that discourage the model from learning effectively (especially in completeness tasks).
- Models trained with moderate reward models exhibit more stable KL divergence profiles, suggesting a balanced training process that avoids mode collapse or over-optimization.