Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model

📝 Paper Summary

Knowledge Internalization Language Modeling Objectives

Language models struggle to answer questions about facts located in the middle or end of training documents due to auto-regressive training, but adding noise or shuffling sentences mitigates this issue.

Core Problem

Despite minimizing perplexity during training, language models suffer from 'perplexity curse' and positional bias: they fail to extract factual knowledge described in the middle or end of training documents via question-answering.

Why it matters:

Standard auto-regressive training creates excessive reliance on previous tokens, meaning facts are only memorable given their specific preceding context
This prevents models from answering flexible user queries that don't match the exact training document sequence
Models may appear to 'know' a document based on low perplexity while being unable to actually recall the information when prompted

Concrete Example: If a training document lists film facts in order (1. Genre, 2. Star, 3. Producer, 4. Release Date), an AR-trained model can answer questions about the Genre (1st sentence) but fails to answer 'When was it released?' (4th sentence), even though it saw the data many times.

Key Novelty

Positional Bias in Parametric Knowledge & Denoising Auto-Regressive Mitigation

Identifies that LMs have a specific 'positional bias' in *storage*: facts appearing later in a training document are harder to retrieve via prompts than facts at the beginning
Attributes this to auto-regressive training, where models learn to predict tokens based on specific long contexts rather than the facts themselves
Proposes using simple regularization—specifically Denoising Auto-Regressive training (replacing tokens with random noise)—to break dependencies on previous tokens and improve recall

Architecture

Visualization of four training methods: Standard Auto-regressive (AR), Denoising AR (D-AR), Sentence Shuffling, and Attention Dropout.

Evaluation Highlights

Llama-2-7B's recall accuracy drops from ~41% (1st sentence) to ~15% (6th sentence) on Wiki2023+ when trained with standard auto-regressive loss
Denoising Auto-Regressive (D-AR) training improves Llama-2-7B's recall significantly, raising 1st-position accuracy to 60.1% and maintaining >20% accuracy in later positions
Large models are not immune: Llama-2 70B trained with AR shows a steep performance drop after the first sentence, while D-AR keeps degradation to <2%

Breakthrough Assessment

7/10

Highlights a fundamental flaw in how LMs store knowledge from training data (positional bias in storage) and offers a simple, effective fix. Strong empirical evidence, though the solution (denoising) is a known technique applied to a new problem.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering (QA) where the model must answer questions based on factual knowledge internalized during continued pre-training (fine-tuning on documents)

Inputs: A question prompt q asking for a specific fact contained in a training document d

Outputs: The correct answer a derived from the model's parameters

Pipeline Flow

Document Processing (Wiki2023+ / SynthLang)
Training (AR / D-AR / Shuffle / Attention Dropout)
Evaluation (Closed-book QA)

System Modules

Training Objective

Optimize model parameters to memorize documents and learn to extract answers

Model or implementation: Llama-2, Llama-3.1, Mistral, Zephyr (diverse sizes)

Novel Architectural Elements

Application of Denoising Auto-Regressive (D-AR) objective specifically for the purpose of mitigating positional bias in parametric knowledge storage

Modeling

Base Model: Llama-2 (7B, 13B, 70B), Llama-3.1 8B, Mistral-7B, Zephyr-7B

Training Method: Continued pre-training (fine-tuning) on new documents mixed with QA pairs

Objective Functions:

Purpose: Standard next-token prediction on documents.

Formally: -1/|d| * sum(log P(d_k | t, d_<k))
Purpose: Instruction tuning loss on QA pairs.

Formally: -1/|a| * sum(log P(a_k | q, a_<k))
Purpose: Denoising Auto-Regressive loss (for D-AR variant).

Formally: -1/|d| * sum(log P(d_k | t, d_tilde_<k)) where d_tilde contains random token replacements

Adaptation: Full fine-tuning (implied by context of continued pre-training experiments)

Trainable Parameters: All parameters (standard fine-tuning)

Training Data:

Wiki2023+: 2385 film documents + 5493 QA pairs for training
SynthLang: 2000 synthetic documents + 10000 QA pairs

Key Hyperparameters:

learning_rate: 1e-5 (linear decay)
optimizer: Adam
batch_size: 256 (mixed sampling)
+ 4 more
iterations: 3000 (Wiki2023+), 1800 (SynthLang)
noise_ratio_D_AR: 0.2 (20% of tokens replaced)
attention_dropout_ratio: 0.5 (SynthLang), 0.2 (Wiki2023+)
max_tokens: 512

Compute: 8 A100 GPUs (40G or 80G). Training Llama-2 7B for 3000 iterations takes ~8 hours.

Comparison to Prior Work

vs. Sentence Shuffling: D-AR is more effective at preserving intra-sentence context while breaking inter-sentence dependencies
vs. Attention Dropout: D-AR explicitly forces prediction from corrupted inputs, leading to better robustness than just dropping attention weights
vs. Standard AR: D-AR prevents 'perplexity curse' where models memorize sequence but cannot recall facts

Limitations

Evaluation is limited to continued fine-tuning; does not test training models from scratch
D-AR might cause under-fitting if applied during early pre-training (though not observed in fine-tuning)
Focuses on short documents (summaries); performance on very long contexts is less explored
Positional bias persists to some degree even with D-AR, especially in smaller models

Reproducibility

Code: https://github.com/omron-sinicx/WhereIsTheAnswer

publicly available (https://github.com/omron-sinicx/WhereIsTheAnswer). Dataset (Wiki2023+) is released. Pre-trained models are standard HuggingFace checkpoints.

📊 Experiments & Results

Evaluation Setup

Closed-book QA: Model consumes training documents, then is asked questions about facts from those documents without context.

Benchmarks:

Wiki2023+ (Real-world factual QA (Film domain)) [New]
SynthLang (Synthetic controlled QA (Spoken languages)) [New]

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Averaged over 3 runs with standard deviation reported for key figures (Fig 4, 5).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on Wiki2023+ (Unmodulated) showing performance degradation by answer position (1=start, 6=end) and improvement via D-AR.
Wiki2023+	EM (Position 1)	40.9	60.1	+19.2
Wiki2023+	EM (Position 6 / End)	14.9	30.4	+15.5
Wiki2023+	Average EM (Positions 1-6)	15.7	31.0	+15.3
Wiki2023+	Average EM	15.7	24.3	+8.6
Wiki2023+	EM (Position 1)	65.3	70.8	+5.5
Wiki2023+	EM (Position 6 / End)	31.7	46.2	+14.5

Experiment Figures

Accuracy of knowledge extraction vs. position of information in training documents for Wiki2023+ and SynthLang.

Perplexity analysis of the *first* sentence when it is virtually moved to later positions in the training document.

Main Takeaways

All studied LMs (Llama-2/3, Mistral, Zephyr) suffer from positional bias in parametric memory: facts at the end of training docs are harder to recall.
This is linked to the 'perplexity curse': AR models rely on long token histories to predict (memorize) text, making them dependent on that specific context.
Denoising Auto-Regressive (D-AR) training is the most effective regularizer, consistently outperforming Sentence Shuffling and Attention Dropout.
Mistral and Zephyr architectures appear inherently more robust to positional bias than the Llama family, though D-AR improves them further.

📚 Prerequisite Knowledge

Prerequisites

Understanding of auto-regressive language modeling (predicting next token given history)
Familiarity with instruction tuning and continued pre-training
Basic knowledge of model regularization techniques (dropout, noise injection)

Key Terms

Parametric Knowledge: Factual information stored within the weights (parameters) of a neural network, as opposed to information provided in an external context window

Auto-Regressive (AR) Training: Training a model to predict the next token in a sequence based on all previous tokens

Perplexity Curse: The phenomenon where a model has low perplexity (predicts text well) but fails to answer questions about that text

Denoising Auto-Regressive (D-AR): A training objective where random tokens in the input are replaced with noise, but the model must still predict the correct original next token

Positional Bias (Storage): The tendency of an LM to memorize information better if it appears at the beginning of a training document compared to the middle or end

Exact Match (EM): A metric measuring if the model's generated answer is character-for-character identical to the ground truth (after normalization)

Attention Dropout: Randomly zeroing out elements of the attention matrix during training to prevent over-reliance on specific token relationships