Evaluation Setup
Video Question Answering across spatial, temporal, and spatio-temporal tasks
Benchmarks:
- STI-Bench (Spatio-temporal intelligence)
- V-STaR (Spatio-temporal reasoning)
- VSI-Bench (Spatial reasoning)
- SPAR-Bench (Spatial reasoning)
- Video-MME (Temporal reasoning)
- TempCompass (Temporal reasoning)
Metrics:
- Accuracy
- Score
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Video-STR outperforms the base model (Qwen2.5-VL-7B-Instruct) by 13% on STI-Bench, validating the graph-based RL approach.
- The method generalizes better than Supervised Fine-Tuning (SFT); while SFT showed localized gains on STI-Bench/VSI-Bench but degradation elsewhere, Video-STR improved consistently across benchmarks.
- Ablation studies confirm the Graph-based Reasoning Mechanism is the most critical component; its removal causes significant performance drops.
- Data quality matters: removing the spatial subset of STV-205k degrades spatial reasoning, and removing the temporal subset degrades temporal understanding.