Unsupervised LLM Adaptation for Question Answering

📝 Paper Summary

Knowledge internalization Post-training optimization

Standard auto-regressive fine-tuning creates a "perplexity curse" where models memorize document tokens sequentially but fail to extract facts located in the middle or end of documents.

Core Problem

Despite minimizing perplexity on training documents, fine-tuned LLMs often fail to answer questions about facts located in the middle or end of those documents.

Why it matters:

Updating LLMs with new domains via fine-tuning is crucial, but standard methods fail to make that knowledge reliably extractable
There is a disconnection between the training objective (predict next token given all history) and the inference need (retrieve specific fact given a short query)
The "perplexity curse" suggests that low training loss does not guarantee effective knowledge acquisition

Concrete Example: A model trained on a biography document accurately answers questions about the first sentence (e.g., birthday) but fails to answer questions about the last sentence (e.g., hobby), even though it can perfectly reconstruct the document text.

Key Novelty

Denoising Auto-Regressive (D-AR) training for knowledge internalization

Demonstrates that the auto-regressive objective creates spurious correlations, causing models to rely on the entire preceding context to predict a fact rather than the fact itself
Proposes randomly corrupting input tokens during training (Denoising Auto-Regressive) to break the rigid dependency on exact token sequences, forcing robust association between concepts
Combines token corruption with sentence shuffling and attention dropout to further force the model to learn facts independent of their specific position in the training document

Architecture

Illustration of three regularization techniques: Denoising Auto-Regressive (D-AR), Shuffling Sentences, and Attention Dropout

Evaluation Highlights

Denoising Auto-Regressive (D-AR) training improves Exact Match accuracy by +39.7% over standard Auto-Regressive training on the Wiki2023+ film dataset
D-AR training enables a 13B model to outperform a standard Auto-Regressive 70B model in retrieving facts from all document positions
On the MedQuAD medical dataset, D-AR improves F1 scores by +5.7 points compared to standard fine-tuning

Breakthrough Assessment

7/10

Identifies a critical, under-explored failure mode in fine-tuning (positional bias in training data) and provides a simple, highly effective fix. The insight about the disconnect between perplexity and extractability is valuable.

⚙️ Technical Details

Problem Definition

Setting: Continual pre-training / Fine-tuning LLMs on new document corpora to acquire new factual knowledge

Inputs: A set of new documents D containing factual statements

Outputs: Answers to questions q probing facts contained within D

Pipeline Flow

Data Preparation (Document + QA Pairs)
Input Corruption (Token Replacement / Shuffling)
Model Training (Next Token Prediction on Answers & Documents)
Inference (Question Answering)

System Modules

Input Corrupter

Apply noise to input embeddings or tokens to prevent overfitting to sequence order

Model or implementation: Rule-based noise injection

Language Model

Learn to predict original tokens from corrupted context

Model or implementation: Llama-2 (7B, 13B, 70B)

Novel Architectural Elements

Application of Denoising Auto-Encoder logic (typically used in BERT-style masking) to Causal Decoder-only fine-tuning for knowledge extraction

Modeling

Base Model: Llama-2 Chat (7B, 13B, 70B)

Training Method: Continued Pre-training / Fine-tuning with Regularization

Objective Functions:

Purpose: Minimize negative log-likelihood of the document tokens given the corrupted context.

Formally: -1/|d| * sum(log P(d_k | t_tilde, d_tilde_<k)) where t_tilde and d_tilde are corrupted title and previous tokens.
Purpose: Standard instruction tuning loss for QA pairs (optional mix).

Formally: -1/|a| * sum(log P(a_k | q, a_<k))

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

Wiki2023+: 1785 training documents (film domain), 100 validation, 500 test
Synthetic Bio: 18000 training questions, 4500 validation/test

Key Hyperparameters:

learning_rate: 1e-5
optimizer: Adam
training_steps: 3000
+ 2 more
batch_size: 256
noise_ratio_R: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAG: Fine-tuning internalizes knowledge into weights rather than retrieving it; D-AR makes this internalization more robust
vs. Standard Fine-tuning: D-AR adds noise/regularization to prevent overfitting to the specific document sequence
vs. Lost-in-the-Middle (Context): This paper addresses training data positional bias (internal knowledge), not prompt context bias

Limitations

Evaluation primarily uses Exact Match which may be too strict for generative tasks
Analysis limited to relatively simple factual triplets (Subject-Attribute-Value)
Requires fine-tuning the entire model, which is computationally expensive compared to RAG
Exact noise hyperparameters for the Denoising Auto-Regressive method are missing from the text

Reproducibility

The paper states they publish synthetic and real datasets (Wiki2023+), but no explicit URL is provided in the text. Llama-2 models are public. Exact noise ratio R for D-AR is not specified in the main text.

📊 Experiments & Results

Evaluation Setup

Question Answering based on fine-tuned documents

Benchmarks:

Wiki2023+ (Real-world factual QA (Film domain)) [New]
Synthetic Bio (Controlled factual QA) [New]
MedQuAD (Medical domain QA)

Metrics:

Exact Match (EM)
F1 score
GPT-Eval (Accuracy via ChatGPT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Wiki2023+ (Film domain) showing the impact of regularization techniques compared to standard Auto-Regressive (AR) baselines.
Wiki2023+	Exact Match (EM)	9.9	49.6	+39.7
Wiki2023+	F1	18.3	59.9	+41.6
Scaling analysis showing that smaller regularized models can outperform larger unregularized models.
Wiki2023+	Exact Match (EM)	45.7	59.8	+14.1
Domain adaptation results on medical data (MedQuAD).
MedQuAD	F1	34.8	40.5	+5.7

Experiment Figures

Accuracy of answering questions about a specific sentence depending on its position in the training document (1st to 10th)

Perplexity of the first sentence when it is moved to different positions in the training document

Main Takeaways

Standard Auto-Regressive training suffers severely from positional bias; facts at the end of training documents are learned but not extractable.
Lower perplexity on training documents does not correlate with better knowledge extraction; in fact, models can minimize perplexity by relying on spurious correlations (previous tokens) rather than the fact itself.
Simple regularization techniques like Denoising Auto-Regressive (token corruption) and Attention Dropout significantly mitigate the perplexity curse.
The proposed regularization allows smaller models (13B) to outperform significantly larger models (70B) that use standard training, indicating data efficiency improvements.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Auto-Regressive (AR) language modeling objectives
Familiarity with Instruction Tuning
Basic knowledge of Perplexity as a metric

Key Terms

Perplexity Curse: The phenomenon where a model achieves low perplexity (good prediction) on training documents but fails to answer questions about the facts contained within them

Positional Bias: In this context, the model's inability to recall information located later in the training document, due to over-reliance on the long prefix of preceding tokens

Denoising Auto-Regressive (D-AR): A training objective where a percentage of input tokens are randomly replaced with noise, forcing the model to predict the next token without perfect reliance on the history

Attention Dropout: Regularization technique that randomly drops elements of the attention matrix, preventing the model from over-fitting to specific token dependencies

Exact Match (EM): A metric measuring whether the generated answer text matches the ground truth answer exactly after normalization

Auto-Regressive (AR): Modeling text by predicting the next token based on the sequence of previous tokens

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

F1 score: A metric balancing precision and recall, used here to evaluate answer quality for longer responses

BioS: Synthetic biography dataset generated for this paper to control factual attributes and positions perfectly

Wiki2023+: Real-world dataset collected from 2023 Wikipedia articles to test domain adaptation on new knowledge