Evaluation Setup
Trained on BPO dataset, evaluated on separate hallucination and general capability benchmarks
Benchmarks:
- AMBER (Hallucination Benchmark)
- Object HalBench (Hallucination Benchmark)
- MMBench (General VQA/Reasoning)
- MME (Comprehensive Evaluation)
- SEED-Bench (General Multimodal Benchmark)
Metrics:
- Accuracy
- Hallucination Rate
- Area Under Gap (AUG) for overfitting analysis
- Statistical methodology: Experiments repeated with three different random seeds; standard deviations reported for analysis
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
Main Takeaways
- DA-DPO successfully reduces the 'Area Under Gap' (AUG) between easy and hard samples compared to vanilla DPO, quantitatively proving reduced overfitting
- The method utilizes a cost-effective, training-free difficulty estimation by ensembling contrastive (CLIP) and generative (LLaVA) signals
- Difficulty-aware training slows down the reward growth on easy buckets, preventing the model from trivializing simple examples and forcing it to engage with harder ones
- Empirical results (qualitatively described) show improvements in both hallucination reduction and general capabilities, suggesting the method balances alignment without catastrophic forgetting