Quantifying reliance on external information over parametric knowledge during Retrieval Augmented Generation (RAG) using mechanistic analysis

📝 Paper Summary

Mechanistic Interpretability RAG Behavior Analysis

Language models use a "shortcut" mechanism in RAG settings, heavily biasing attention toward retrieved context tokens while suppressing reliance on internal parametric memory for factual predictions.

Core Problem

While RAG is widely used to improve factual accuracy, the internal mechanical process by which LLMs prioritize retrieved context over their pre-trained (parametric) knowledge is not clearly understood.

Why it matters:

Understanding how models balance internal vs. external knowledge is crucial for diagnosing hallucinations and inconsistencies in RAG systems
Previous work focused on editing knowledge (ROME, MEMIT) or system-level RAG performance, leaving a gap in mechanistic understanding of the inference process itself

Concrete Example: When a model answers 'Paris' for 'The Eiffel Tower is located in...', it typically relies on internal weights. When provided with a document saying 'The Eiffel Tower is in Las Vegas', we do not mechanistically know if the model suppresses its internal 'Paris' weights or simply overwrites the output at the last layer.

Key Novelty

Mechanistic "Shortcut" Analysis of RAG

Applies Causal Mediation Analysis to compare internal activation patterns between standard generation and RAG-based generation
Demonstrates that the presence of context causes a 'shortcut' effect: the model effectively bypasses the usual internal computation path rooted in the subject token (e.g., 'Eiffel Tower') and instead attends directly to the answer token provided in the context

Evaluation Highlights

~10x decrease in Average Indirect Effect (AIE) on the Last Subject Token for Llama-2-7B when RAG context is added, indicating reduced reliance on internal memory
~35x decrease in AIE on the Last Subject Token for Phi-2 (2.7B) in RAG settings compared to vanilla generation
Knocking out attention from the subject token reduces answer probability by ~20-25% in vanilla models, but less than 5% in RAG settings, confirming the shift in reliance

Breakthrough Assessment

7/10

Provides valuable mechanistic evidence confirming intuitions about RAG behavior (context bias). While the findings are expected, quantifying them via causal tracing and attention knockouts adds rigorous interpretability.

⚙️ Technical Details

Problem Definition

Setting: Factual question answering under two conditions: Vanilla (parametric only) and RAG (context provided)

Inputs: Prompt p containing a subject s and optionally retrieved context c

Outputs: Predicted answer token y

Pipeline Flow

Input Generation (Knowns dataset + GPT-4 synthetic context)
Model Inference (Clean Run)
Model Inference (Corrupted Run with noise)
Causal Tracing / Attention Analysis (Patching & Knockouts)

System Modules

Context Generator

Generate synthetic RAG context containing the attribute or object for the Knowns dataset

Model or implementation: GPT-4

Target Model

Perform inference on prompts to analyze internal states

Model or implementation: Llama-2 (7B) and Phi-2 (2.7B)

Causal Tracer

Compute Indirect Effect of hidden states by patching clean activations into corrupted runs

Model or implementation: Algorithm (ROME methodology)

Modeling

Base Model: Llama-2 (7B) and Phi-2 (2.7B)

Comparison to Prior Work

vs. ROME: This paper applies ROME's causal tracing diagnostics to RAG specifically, rather than for model editing
vs. Liu et al. (2023): Goes beyond observing inconsistent responses to mechanistically explaining *why* via internal state analysis

Limitations

Analysis restricted to relatively small models (Llama-2 7B, Phi-2 2.7B), not tested on >13B parameters
Uses synthetic GPT-4 generated context rather than a real retrieval pipeline
Focuses on short-context factual QA; does not analyze long-context or complex reasoning behaviors

Reproducibility

Data: Uses 'Knowns 1000' dataset (public) augmented with GPT-4 synthetic context. Code: Not provided in the paper. Models: Llama-2 and Phi-2 are open weights.

📊 Experiments & Results

Evaluation Setup

Comparison of internal model mechanics between 'Vanilla' (no context) and 'RAG' (context provided) settings on factual queries.

Benchmarks:

Knowns 1000 (Factual knowledge completion)

Metrics:

Average Indirect Effect (AIE)
Attention Contribution Norm
Probability drop after Attention Knockout
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Causal Tracing reveals a massive drop in reliance on the Subject Token when RAG context is available, indicating the model stops looking 'inside' for the fact associated with the subject.
Knowns 1000 (RAG vs Vanilla)	AIE Decrease on LST	Not explicitly reported in the paper	Not explicitly reported in the paper	~10x decrease
Knowns 1000 (RAG vs Vanilla)	AIE Decrease on LST	Not explicitly reported in the paper	Not explicitly reported in the paper	~35x decrease
Attention Analysis confirms the 'shortcut': the model shifts attention away from the subject token (internal lookup) toward the answer token in the context.
Knowns 1000	Mean Attention Contribution Decrease (ST to LT)	Not explicitly reported in the paper	Not explicitly reported in the paper	~1.6x decrease
Knowns 1000	Mean Attention Contribution Decrease (ST to LT)	Not explicitly reported in the paper	Not explicitly reported in the paper	~7x decrease
Knowns 1000	Prediction Probability Drop	20%	<5%	-15% (approx)
Knowns 1000	Prediction Probability Drop	25%	<5%	-20% (approx)

Main Takeaways

LMs exhibit a 'shortcut' effect in RAG: they bypass the computation path usually involved in recalling facts from parametric memory (via the subject token).
Reliance on the Last Subject Token (LST) drops precipitously (10x-35x) when context is present, verified by Causal Tracing.
The residual stream of the Last Token (LT) is enriched by the retrieved context tokens rather than the subject token in the question.
These behaviors hold true across both Large Language Models (Llama-2 7B) and Small Language Models (Phi-2 2.7B).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (attention mechanisms, residual streams)
Mechanistic Interpretability concepts (activations, hidden states)
Causal Tracing methodologies (corrupted runs, restoration runs)

Key Terms

RAG: Retrieval-Augmented Generation—providing external documents in the prompt to help the model answer questions

Parametric Memory: Knowledge stored within the model's pre-trained weights (e.g., facts it 'knows' from training)

Causal Tracing: A technique to identify which specific internal neural activations cause a model to output a specific prediction by corrupting and restoring states

Average Indirect Effect (AIE): A metric quantifying how much a specific hidden state contributes to the probability of the correct answer

Subject Token: The token in the query representing the entity being asked about (e.g., 'Tower' in 'Where is the Eiffel Tower?')

Last Token (LT): The final token position in the prompt sequence, from which the next token (the answer) is predicted

Residual Stream: The primary vector pathway in a Transformer where information is processed and passed between layers

Attention Knockout: A probing method where specific attention edges (connections between tokens) are zeroed out to measure their importance to the prediction

SLM: Small Language Model—typically models with fewer than ~7 billion parameters (e.g., Phi-2)