Fromragsto rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

📝 Paper Summary

Mechanistic Interpretability Retrieval-Augmented Generation (RAG)

Mechanistic probing reveals that language models exhibit a 'shortcut' behavior in RAG settings, bypassing internal parametric knowledge in favor of attending directly to context tokens.

Core Problem

While RAG is widely used to mitigate hallucinations, it is unclear mechanistically how models balance their internal parametric knowledge against external retrieved context when answering factual queries.

Why it matters:

Understanding the interplay between internal priors and external context is crucial for preventing model drift and ensuring robust reasoning
Existing knowledge editing techniques focus on updating parameters, but lack insight into how RAG context overrides these parameters dynamically during inference
Blindly trusting RAG without understanding the mechanism can lead to inconsistent predictions even with perfect retrieval

Concrete Example: For the query 'The Space Needle is located in the city of', a vanilla model relies on internal MLPs to retrieve 'Seattle'. When RAG context is added, the model might ignore its internal knowledge entirely and copy the answer from the context, but the internal mechanism of this switch was previously unproven.

Key Novelty

Mechanistic Evidence of RAG Shortcuts

Demonstrates via Causal Tracing that the average indirect effect of subject tokens (which usually trigger fact retrieval) drops significantly when RAG context is present
Uses Attention Knockouts to prove the model's last token stops attending to the query subject and instead attends strongly to the answer token in the context
Quantifies the 'shortcut' mechanism: models effectively turn off their internal factual retrieval circuits in favor of a copy-mechanism from the context

Evaluation Highlights

In Llama-2 (7B), the Average Indirect Effect (AIE) of subject tokens on the prediction drops ~5x (from ~0.20 to ~0.0375) when RAG context is introduced
For Phi-2, Attention Contribution from the query Subject Token to the Last Token drops ~7x in the RAG setting (10.7 vs 72.6 in vanilla)
Knocking out attention from the subject token reduces prediction probability by <5% in RAG settings, compared to ~20-25% in vanilla settings, proving reliance on context over query subject

Breakthrough Assessment

7/10

Provides the first mechanistic proof of 'shortcut' behavior in RAG, validating common intuitions with hard evidence from causal tracing and attention analysis.

⚙️ Technical Details

Problem Definition

Setting: Factual query answering under two conditions: Vanilla (parametric memory only) vs. RAG (context augmented)

Inputs: A factual query q = (s, r, o) where s is subject, r is relation, o is object; optionally augmented with context c containing o

Outputs: Probability distribution over the next token (target object o)

Pipeline Flow

Input Construction (Vanilla vs. RAG)
Forward Pass with Causal Tracing / Attention Analysis
Corrupted Runs (Noise added to subject/context)
Restoration/Knockout Runs (Patching or masking activations)
Effect Measurement (AIE / Attention Norms)

System Modules

Synthetic RAG Generator

Generate controlled RAG contexts ensuring the attribute token appears exactly once

Model or implementation: GPT-4

Probing Framework

Perform causal mediation analysis and attention knockouts

Model or implementation: Custom implementation (ROME/MEMIT based)

Modeling

Base Model: Analyzed models: Llama-2 (7B) and Phi-2 (2.7B)

Reproducibility

Code: https://github.com/hiteshw/rag-mechanistic-interpretability

📊 Experiments & Results

Evaluation Setup

Mechanistic probing of factual queries under Vanilla (no context) vs. RAG (context provided) conditions

Benchmarks:

Knowns Fact Dataset (Factual knowledge completion (Subject, Relation, Object))

Metrics:

Average Indirect Effect (AIE)
Attention Contribution (Norm of attention weights)
Probability Drop (under Attention Knockout)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Causal Tracing (AIE) analysis reveals a massive reduction in reliance on subject tokens when RAG context is available, indicating the model bypasses parametric memory.
Knowns Fact Dataset (subset)	Average Indirect Effect (AIE)	0.20	0.0375	-0.1625
Attention Contribution analysis shows the last token shifts attention from the query subject to the context attribute (answer).
Knowns Fact Dataset	Attention Contribution (ST to LT)	72.5961	10.6650	-61.9311
Knowns Fact Dataset	Attention Contribution (ST to LT)	9.0054	5.6094	-3.3960
Knowns Fact Dataset	Attention Contribution (RAG setting)	10.6650	20.8902	+10.2252
Attention Knockout experiments confirm that cutting the link from the subject token destroys performance in Vanilla but barely matters in RAG.
Knowns Fact Dataset	Probability Drop (Prediction)	0.25	0.05	-0.20
Knowns Fact Dataset	Probability Drop (Prediction)	0.20	0.05	-0.15

Main Takeaways

Language models exhibit a 'shortcut' mechanism in RAG: they actively decouple from parametric memory (MLPs associated with subject tokens) and switch to a context-copying mode.
The 'Last Token' residual stream stops aggregating information from the Subject Token in the query and instead aggregates from the Attribute Token (answer) in the context.
This behavior is consistent across model families (Llama vs. Phi) and sizes, suggesting a fundamental property of how Transformers handle context vs. priors.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, MLPs, Residual Streams)
Mechanistic Interpretability basics (Causal Tracing, Activation Patching)
RAG (Retrieval-Augmented Generation) concepts

Key Terms

Causal Tracing: A technique to identify which hidden states significantly influence a model's output by corrupting inputs and restoring specific activations to recover the correct prediction

Average Indirect Effect (AIE): A metric quantifying the causal importance of a specific model component (like a hidden state) on the final prediction probability

Subject Token (ST): The token(s) in the query representing the entity about which the fact is being asked (e.g., 'Space Needle')

Attribute Token (AT): The token in the RAG context that contains the answer to the query (also referred to as the object)

Last Token (LT): The final token position in the input sequence, from which the next token prediction is generated

Attention Knockout: A method to test the importance of specific attention heads by masking (setting to negative infinity) the attention scores between specific token pairs

Parametric Memory: Knowledge stored in the model's fixed weights (parameters), typically accessed via MLPs

Residual Stream: The primary vector pathway in Transformers where information is accumulated across layers