TTFT: Time-To-First-Token—the latency required to process the input prompt and generate the first token of the response
TTIT: Time-To-Inter-Token—the latency between generating subsequent tokens
KV Cache: Key-Value Cache—memory storage for intermediate attention representations in Transformers to avoid re-computation
Block-diagonal attention: An attention pattern where tokens primarily attend to their local neighborhood (e.g., within a passage) rather than globally across all passages
Curriculum learning: A training strategy where the model starts with easy tasks (reconstructing 1 chunk) and gradually moves to hard tasks (reconstructing L chunks)
CPT: Continual Pre-training—further training a base model on specific data (here, for compression alignment) before fine-tuning
SFT: Supervised Fine-Tuning—training the model on labeled task data
Perplexity: A measurement of how well a probability model predicts a sample; lower is better