Two-stage retrieval: A standard industrial practice where a lightweight model first selects a small subset of items (e.g., top-100) from a long history, which are then fed to a heavy ranking model.
KV Cache: Key-Value Cache—storing precomputed attention representations of the user sequence so they don't need to be recalculated for every candidate item during ranking.
Token Merge: A technique to reduce sequence length by grouping adjacent tokens and representing them as a single vector, often using a small local model (InnerTrans).
FLOPs: Floating Point Operations—a measure of computational cost.
Mixed Precision: Training using lower-precision numerical formats (like BF16 or FP16) to save memory and speed up computation without significant accuracy loss.
Activation Recomputation: A memory-saving technique where intermediate activations are discarded during the forward pass and re-calculated during the backward pass to fit larger models on GPUs.
Global Tokens: Special tokens (like CLS or target item) added to the sequence that can attend to everything, acting as anchors for information aggregation.
Cross-Attention: Attention mechanism where the query comes from one source (e.g., target item) and keys/values come from another (e.g., user history).
Self-Attention: Attention mechanism where queries, keys, and values all come from the same sequence, capturing internal dependencies.