| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| F-DPO consistently reduces hallucination rates across all model sizes compared to base models and standard DPO. | ||||
| Held-out Skywork | Hallucination Rate | 0.424 | 0.084 | -0.340 |
| Held-out Skywork | Hallucination Rate | 0.418 | 0.084 | -0.334 |
| Held-out Skywork | Factuality Score | 5.26 | 7.90 | +2.64 |
| Out-of-distribution evaluation on TruthfulQA shows F-DPO generalizes better than baselines. | ||||
| TruthfulQA | MC1 Accuracy | 0.500 | 0.585 | +0.085 |
| TruthfulQA | MC2 Accuracy | 0.357 | 0.531 | +0.174 |
| TruthfulQA | MC1 Accuracy | 0.472 | 0.585 | +0.113 |