Perplexity Curse: The phenomenon where a model achieves low perplexity (good prediction) on training documents but fails to answer questions about the facts contained within them
Positional Bias: In this context, the model's inability to recall information located later in the training document, due to over-reliance on the long prefix of preceding tokens
Denoising Auto-Regressive (D-AR): A training objective where a percentage of input tokens are randomly replaced with noise, forcing the model to predict the next token without perfect reliance on the history
Attention Dropout: Regularization technique that randomly drops elements of the attention matrix, preventing the model from over-fitting to specific token dependencies
Exact Match (EM): A metric measuring whether the generated answer text matches the ground truth answer exactly after normalization
Auto-Regressive (AR): Modeling text by predicting the next token based on the sequence of previous tokens
RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
F1 score: A metric balancing precision and recall, used here to evaluate answer quality for longer responses
BioS: Synthetic biography dataset generated for this paper to control factual attributes and positions perfectly
Wiki2023+: Real-world dataset collected from 2023 Wikipedia articles to test domain adaptation on new knowledge