Audio Description (AD): Narrative tracks added to video content to describe visual elements (actions, scenes, characters) for visually impaired audiences
In-Context Learning (ICL): A technique where a large language model learns to perform a task from a few examples provided in the prompt without parameter updates
Chain-of-Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps before producing the final answer
CLIP: Contrastive Language-Image Pre-training—a model that learns to map images and text to a shared embedding space, used here for visual feature extraction
ASR: Automated Speech Recognition—technology that converts spoken audio into text (subtitles)
NER: Named Entity Recognition—a subtask of information extraction that seeks to locate and classify named entities (like person names) in text
Register-and-Recall: A memory mechanism where past information (visual features) is stored ('registered') and later retrieved ('recalled') based on similarity to current inputs