← Back to Paper List

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
Nanyang Technological University, Simon Fraser University, Shanghai AI Laboratory
arXiv.org (2025)
MM Agent RAG RL Reasoning Benchmark

📝 Paper Summary

Egocentric Video Understanding Multimodal Agents Long-context Video Reasoning
Ego-R1 uses an RL-trained agent with Chain-of-Tool-Thought reasoning to dynamically select retrieval and visual analysis tools for answering questions about week-long egocentric videos.
Core Problem
Current video understanding models fail on ultra-long (multi-day) egocentric videos due to limited context windows, poor scaling, and inability to handle sparse events spanning extended durations.
Why it matters:
  • Egocentric videos capture daily life over days or weeks, requiring models to link widely dispersed cues (e.g., habits, long-term goals) rather than just short-term actions.
  • Existing long-context models struggle computationally with day-long videos, while token compression or sampling risks missing key events.
  • Prior video agents rely on fixed-order tool invocations, limiting flexibility and the ability to handle the complex, evolving narratives of multi-day recordings.
Concrete Example: If asked about a person's dietary habits over a week, a standard model sampling frames might miss specific meals. An independently trained agent might search blindly. Ego-R1 iteratively retrieves timestamps via text logs, then zooms in with a Video-LLM to verify content, linking breakfast on Monday to a pattern seen on Friday.
Key Novelty
Dynamic Chain-of-Tool-Thought (CoTT) via RL
  • Instead of processing the entire video at once, an LLM agent dynamically selects tools (Retrieval, Video-LLM, VLM) step-by-step based on previous observations.
  • Uses Hierarchical RAG (H-RAG) to search text logs of the video first, narrowing down long timelines to specific timestamps before invoking expensive visual tools.
  • Trains the orchestrating agent via Reinforcement Learning (RL) on a custom dataset (Ego-R1 Data) to learn optimal tool-use sequences rather than relying solely on supervised prompts.
Evaluation Highlights
  • Extends effective reasoning coverage from a few hours to a full week on the newly curated Ego-R1 Bench.
  • Ego-R1 Agent outperforms baselines on long-horizon reasoning tasks by dynamically selecting optimal tools.
  • Ablation studies confirm the necessity of all three tools (H-RAG, Video-LLM, VLM), showing that removing visual verification tools significantly degrades performance.
Breakthrough Assessment
8/10
Significant step in scaling video understanding to 'ultra-long' (week-long) contexts by combining RL-based agentic reasoning with hierarchical retrieval, moving beyond simple context-window extension.
×