Gated DeltaNet: A linear attention variant that uses a recurrent update rule with gating and Householder-like rotations to maintain a compressed memory state without growing KV cache.
SWA: Sliding Window Attention—attention mechanism restricted to a fixed local window of recent tokens, reducing complexity from quadratic to linear but losing distant context.
KV cache: Key-Value cache—storage of intermediate attention states in Transformers; grows linearly with sequence length, causing memory bottlenecks.
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformer models.
Distillation: Training a smaller or more efficient 'student' model to mimic the outputs of a larger 'teacher' model.
SFT: Supervised Fine-Tuning—training on labeled instruction-following data.
LoRA: Low-Rank Adaptation—parameter-efficient fine-tuning method that freezes main weights and trains small rank-decomposition matrices.
Householder rotation: A linear transformation used in Gated DeltaNet to reorient the memory matrix, preventing low-rank collapse during updates.