Temporal Memory: A structured storage of short video segment descriptions (captions) and their embedding features, indexed by time
Object Memory: A database tracking object occurrences, containing a feature table for visual similarity and a SQL database for querying relationships and timelines
LaViLa: A video captioning model that generates detailed textual descriptions for video clips, used here to populate Temporal Memory
Re-ID: Re-identification—the process of determining whether different tracked object instances across video frames correspond to the same unique object entity
CLIP: Contrastive Language-Image Pre-training—a model that aligns text and images in a shared embedding space, used here for feature matching
ByteTrack: A multi-object tracking algorithm used to associate detection boxes across frames
ViCLIP: A video-text retrieval model used to compute similarity between text queries and video segments
Video-LLaVA: A multimodal LLM used here as a specific tool for Visual Question Answering on short retrieved segments