CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG

📝 Paper Summary

Modularized RAG pipeline Answer generation

CrAM mitigates RAG hallucinations caused by misinformation by identifying influential attention heads and scaling down their weights for low-credibility documents during inference without fine-tuning.

Core Problem

RAG systems often retrieve documents containing misinformation, which misleads LLMs into generating incorrect answers because standard models lack mechanisms to down-weight low-credibility sources.

Why it matters:

Maliciously generated misinformation in external corpora can significantly degrade LLM performance
Simply filtering out low-credibility documents risks losing relevant information, leading to inferior performance compared to soft adjustment
Existing solutions like Supervised Fine-Tuning (SFT) require expensive resources and curated data, limiting applicability

Concrete Example: When asking 'Who was the first person to win the Nobel Prize in Physics?', a retrieval system might fetch a misinformation document claiming it was Einstein (instead of Roentgen). A standard LLM attends to this false document and answers 'Einstein', whereas CrAM suppresses attention to the false document based on its low credibility score.

Key Novelty

Credibility-aware Attention Modification (CrAM)

Identify 'influential' attention heads that contribute most to generating incorrect answers when misinformation is present, using a modified causal tracing method
Modify the attention weights of these specific heads during inference by element-wise multiplication with normalized document credibility scores
Allows the LLM to 'pay less attention' to tokens from low-credibility documents without retraining the model or discarding the documents entirely

Architecture

The CrAM workflow: 1) Identification of influential attention heads using causal tracing on a small set, 2) Modification of attention weights for those heads during inference based on document credibility scores.

Evaluation Highlights

+31.9% Exact Match (EM) improvement over Prompt Based baseline on TriviaQA using Llama2-13B in the presence of misinformation
+21.1% EM improvement over Naive RAG on Natural Questions using Llama2-13B when one misinformation document is present
Surpasses Supervised Fine-Tuning (SFT) methods like CAG in robustness against misinformation while remaining training-free

Breakthrough Assessment

7/10

Effective plug-and-play solution for a critical RAG problem (misinformation). Outperforming SFT methods without training is significant, though reliance on external credibility scores is a dependency.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering with Retrieval-Augmented Generation under misinformation pollution

Inputs: User query x, retrieved documents D, and document credibility scores S

Outputs: Generated answer y

Pipeline Flow

Influential Head Identification (Offline): Causal tracing on small dataset → Rank heads by contribution to errors
Inference (Online): Retrieve documents → Estimate Credibility → Modify Attention Weights of identified heads → Generate Answer

System Modules

Influential Head Identifier

Select top-ranked attention heads that contribute most to generating incorrect answers from misinformation

Model or implementation: Same as base LLM (e.g., Llama-3-8B)

Attention Modifier (Generation)

Adjust attention matrices of influential heads during the forward pass

Model or implementation: Intervention hook inside LLM attention layers

Generator (Generation)

Generate final answer using modified attention

Model or implementation: Llama-2-13B, Llama-3-8B, or Qwen1.5-7B

Novel Architectural Elements

Inference-time attention weight modification layer that injects external credibility scores directly into the attention mechanism of specific heads

Modeling

Base Model: Evaluated on Llama2-13B, Llama3-8B, and Qwen1.5-7B

Training Method: Training-free inference-time intervention (Influential head selection uses a small validation set)

Key Hyperparameters:

head_selection_data_size: 100 samples (randomly selected)
validation_data_size: 100 samples (to determine k top heads)

Compute: Requires one-time forward pass analysis on small dataset (100 samples) to identify heads; negligible overhead during inference

Comparison to Prior Work

vs. Exclusion: CrAM retains information from low-credibility docs (soft weighting) preventing loss of useful context, whereas Exclusion discards them entirely
vs. CAG: CrAM is training-free and plug-and-play, whereas CAG requires fine-tuning resources
vs. Prompt Based: CrAM modifies internal model mechanics (attention), offering stronger control than surface-level prompting

Limitations

Relies on the availability of accurate credibility scores (either ground truth or estimated)
Performance depends on the quality of the external credibility estimator
Involves a hyperparameter search for the number of influential heads to modify
Experiments limited to single-hop QA datasets (NQ, TriviaQA)

Reproducibility

Code: https://github.com/Aatrox103/CrAM

Code is publicly available at https://github.com/Aatrox103/CrAM. Uses open-source models (Llama, Qwen) and standard datasets (NQ, TriviaQA). Misinformation documents are generated via GPT-3.5 prompting (prompts provided in Appendix).

📊 Experiments & Results

Evaluation Setup

Open-domain QA with injected misinformation documents

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing CrAM against non-SFT baselines (Naive RAG, Prompt Based) under 'Ideal' credibility scores (1 for high-quality, 0 for misinformation).
TriviaQA	Exact Match (EM)	23.10	59.90	+36.80
Natural Questions	Exact Match (EM)	12.50	33.60	+21.10
TriviaQA	Exact Match (EM)	36.80	64.40	+27.60
Results using GPT-generated credibility scores (realistic setting) comparing against Exclusion baseline.
TriviaQA	Exact Match (EM)	54.40	56.20	+1.80
Natural Questions	Exact Match (EM)	26.60	30.70	+4.10

Experiment Figures

F1 scores of CrAM vs. CAG (SFT method) on NQ as the number of misinformation documents increases (1 to 3).

Density distribution of Indirect Effect (IE) values for attention heads in Llama3-8B.

Main Takeaways

CrAM consistently mitigates the impact of misinformation, restoring performance close to or exceeding clean-retrieval baselines.
Soft attention modification (CrAM) outperforms hard document filtering (Exclusion), suggesting that even low-credibility documents may contain useful context.
Identifying influential heads is crucial; modifying all heads (CrAM-all) degrades performance compared to selecting specific heads.
The method is robust across different model sizes (7B to 13B) and families (Llama, Qwen).

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanisms (query, key, value)
Retrieval-Augmented Generation (RAG) basics
Causal Tracing (mechanistic interpretability)

Key Terms

CrAM: Credibility-aware Attention Modification—the proposed method to adjust attention weights based on document credibility

Credibility score: A probability score indicating the likelihood that a document does not contain misinformation

Indirect Effect (IE): A metric from causal tracing quantifying the contribution of a specific model component (e.g., attention head) to a model's output probability

Attention head: A component in Transformer models that learns to focus on different parts of the input sequence

Causal tracing: A technique to locate which parts of a neural network are responsible for specific factual predictions by adding noise and observing output changes

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset to adapt it to a specific task

EM: Exact Match—an evaluation metric measuring if the generated answer exactly matches the ground truth

F1 score: A metric balancing precision and recall, measuring word overlap between the prediction and ground truth

Naive RAG: Standard RAG pipeline that retrieves documents and generates answers without special handling for credibility or misinformation