Evaluation Setup
Evaluation on multiple egocentric video benchmarks covering QA, long-term reasoning, and grounding.
Benchmarks:
- EgoTimeQA (Egocentric QA (action, temporal))
- Ego-QA (Egocentric QA)
- EgoExoLearn (Temporal Grounding / QA)
- EK-Visor (Hand-Object Grounding)
Metrics:
- Accuracy (QA)
- mIoU (Grounding)
- Success Rate (Grounding)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| EgoThinker demonstrates superior performance on diverse egocentric QA and grounding benchmarks compared to state-of-the-art MLLMs. |
| EgoTimeQA |
Accuracy |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
| Ego-QA |
Accuracy |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- EgoThinker sets new state-of-the-art results across multiple benchmarks (EgoTimeQA, Ego-QA, EgoExoLearn).
- The two-stage training (SFT + RFT) significantly improves fine-grained spatio-temporal localization compared to SFT alone.
- The EgoRe-5M dataset enables models to learn causal chains and intentions that are absent in standard observer-centric datasets.