CoTT: Chain-of-Tool-Thought—a reasoning process where an agent decomposes a problem into steps, invoking specific tools (retrieval, vision) at each step to gather evidence
H-RAG: Hierarchical Retrieval-Augmented Generation—a system that summarizes video into text logs at multiple granularities (clips to days) to enable efficient top-down temporal search
Video-LLM: A multimodal model designed to process short video clips (seconds to minutes) and answer questions about temporal dynamics and actions
VLM: Vision-Language Model—a model that processes individual static images (frames) to extract fine-grained visual details like text or small objects
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs (here, reasoning traces) before applying reinforcement learning
RL: Reinforcement Learning—a training method where an agent learns to make decisions (tool selection) by maximizing a reward signal
Egocentric video: Video recorded from a first-person perspective (e.g., smart glasses), capturing the wearer's daily activities and interactions
ASR: Automatic Speech Recognition—converting spoken audio in the video into text transcripts
Ego-R1 Bench: A benchmark dataset created by the authors containing week-long egocentric videos with human-verified QA pairs