Evaluation Setup
Video Question Answering on standard benchmarks
Benchmarks:
- MSVD-QA (Video QA)
- MSRVTT-QA (Video QA)
- TGIF-QA (Video QA)
Metrics:
- Accuracy (assessed by ChatGPT)
- Score (1-5 scale)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DPO training significantly improves performance over the SFT baseline and other SOTA models across combined benchmarks. |
| Average (MSVD, MSRVTT, TGIF) |
Accuracy |
62.65 |
70.75 |
+8.10
|
| Average (MSVD, MSRVTT, TGIF) |
Accuracy |
59.40 |
70.75 |
+11.35
|
| Average (MSVD, MSRVTT, TGIF) |
Accuracy |
66.50 |
70.75 |
+4.25
|
| Average (MSVD, MSRVTT, TGIF) |
Accuracy |
67.80 |
70.75 |
+2.95
|
Main Takeaways
- Detailed video captions can effectively substitute for video content in reward modeling, enabling cheap and scalable DPO.
- The proposed LLaVA-Hound-DPO sets a new SOTA for Video QA, outperforming both its SFT base and other recent models like LLaMA-VID.
- Text-based reward calculation (ChatGPT + Captions) correlates well (Pearson 0.47) with expensive Vision-based reward calculation (GPT-4V + Frames).
- Pre-training on large-scale video captions (ShareGPTVideo) improves generalization, particularly for out-of-domain tasks.