Zero-shot: The ability of a model to perform a task without having explicitly trained on examples of that specific task
API: Application Programming Interface—here, a set of defined functions (tools) the LLM can call in its generated code
BLIP-2: A vision-language model used here for image captioning and visual question answering on specific frames
GroundingDINO: A text-conditioned object detection model that finds bounding boxes for objects described by text
ByteTrack: A multi-object tracking algorithm that associates detected objects across video frames to maintain identity over time
LaViLa: A video-language model specialized in long-form video understanding and narrations (used here for summarization)
In-context learning: Providing the LLM with example inputs and outputs (e.g., example questions and their corresponding code) in the prompt to guide its generation
FastText: A library for efficient text classification and representation learning, used here to map open-ended outputs to fixed vocabularies