RALM: Retrieval-Augmented Language Modeling—integrating LLMs with external documents to extend knowledge beyond training data
KV cache: Key-Value cache—storing calculated attention representations of previous tokens to avoid recomputing them at every generation step
Retrieval Stride: The frequency at which the model queries the retriever (e.g., every s tokens)
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices
Marking Token: Special learnable tokens (<MARK_L>, <MARK_R>) introduced by this paper to delimit retrieved content in the context
Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance
BM25: Best Matching 25—a probabilistic information retrieval function based on bag-of-words ranking
FLOPs: Floating Point Operations—a measure of computer performance and computational cost