CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies by comparing a group of sampled outputs for the same input, removing the need for a separate critic model
SFT: Supervised Fine-Tuning—training a model on a dataset of labeled input-output pairs (usually the first step before RL)
Prefilling: The initial phase of LLM inference where the model processes the input tokens (image/video/text) to compute Key-Value caches
Decoding: The sequential generation phase of LLM inference where the model produces output tokens one by one
Token Compression: Techniques like pruning (removing) or merging (combining) visual tokens to reduce computational cost
KV Cache: Key-Value Cache—stored intermediate states in the Transformer attention mechanism used to speed up generation
Video-R1: A baseline CoT video model that uses a two-stage SFT + RL pipeline to generate long reasoning traces
AIM: A token compression method that merges similar tokens and prunes uninformative ones