ECR: Embedding-Centric Reasoning—generative reasoning traces (text) produced by the model to explicitly explain or describe the input before creating an embedding
TTE: Think-Then-Embed—the proposed framework where a model generates reasoning text first, then uses that text to condition the final embedding
TTEt: TTE with Teacher Reasoner—a variant using a large frozen model (e.g., Qwen2.5-72B) to generate reasoning traces
TTEs: TTE with Student Reasoner—a variant where a smaller model is fine-tuned to generate reasoning traces itself
TTEu: TTE with Unified Reasoner—a single backbone model that performs both reasoning generation and embedding extraction
MMEB: Massive Multimodal Embedding Benchmark—a collection of diverse retrieval tasks including VQA, grounding, and classification
InfoNCE: Information Noise-Contrastive Estimation—a loss function used to learn embeddings by pulling positive pairs together and pushing negative pairs apart
MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and image data
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters
VQA: Visual Question Answering—a task where the model must answer a natural language question about an image
GradCache: Gradient Cache—a technique to scale batch size in contrastive learning by decoupling the forward and backward passes