KV cache: Key-Value cache—storing intermediate attention representations to avoid recomputing them for every token during generation
PCW: Parallel Context Windows—a method where contexts are encoded independently (no cross-attention between them) to speed up processing
FiD: Fusion-in-Decoder—an architecture that encodes passages independently and decodes jointly, originally for encoder-decoder models
Decoder-only: LLM architectures like GPT or Llama that use only a decoder stack, typically utilizing causal masking
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
Per Context Assessment (PCA): The auxiliary task introduced in this paper where the model predicts a relevance score (e.g., probability of token 'Good') for each retrieved document
BM25: A standard probabilistic information retrieval function used to rank documents based on keyword matching
Exact Match (EM): Evaluation metric measuring if the generated answer is identical to the ground truth
RougeLSum: Evaluation metric for summarization measuring the overlap of longest common subsequences between generated and reference summaries