AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge

📝 Paper Summary

Knowledge Conflict in LLMs Contrastive Decoding

AdaCAD dynamically adjusts the decoding weight for every token based on the Jensen-Shannon divergence between contextual and parametric distributions to resolve knowledge conflicts.

Core Problem

Existing contrastive decoding methods use a fixed weight to balance context and parametric knowledge, but real-world data contains varying degrees of conflict (or no conflict).

Why it matters:

Fixed-weight methods (like CAD) over-correct on low-conflict examples, degrading performance on standard queries where the model is already correct
LLMs struggle to prioritize retrieved context over outdated parametric memory when conflicts arise (e.g., outdated Olympic host counts)
Current dynamic methods rely on coarse binary classification (high/low conflict) or heuristics that require additional noisy contexts

Concrete Example: If an LLM knows France hosted the Olympics 2 times (parametric) but a retrieved document says 3 times (context), CAD helps. However, if the document also says 2 times (no conflict), CAD over-adjusts the distribution, leading to a nonsensical answer, whereas AdaCAD detects low conflict and reduces adjustment.

Key Novelty

Adaptive Context-Aware Decoding (AdaCAD)

Measures the 'degree of conflict' at each decoding step by calculating the Jensen-Shannon Divergence (JSD) between the model's output distribution with context vs. without context
Uses this JSD value to dynamically scale the decoding adjustment weight (alpha) per token: high divergence implies high conflict (needs strong adjustment), low divergence implies agreement (needs weak adjustment)

Architecture

Conceptual illustration of how AdaCAD handles varying knowledge conflicts compared to CAD and Greedy decoding.

Evaluation Highlights

+14.21% average QA accuracy gain over static Context-Aware Decoding (CAD) across four LLMs and six datasets
+10.29% accuracy improvement over the COIECD baseline on the high-conflict NQ-SWAP dataset
+6.19 AlignScore improvement in summarization factuality compared to standard decoding

Breakthrough Assessment

7/10

Simple, training-free, and effective solution to a known limitation of contrastive decoding. The dynamic JSD-based weighting is intuitive and outperforms more complex baselines like COIECD.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering and Summarization with retrieved/provided context

Inputs: Query x and relevant context c

Outputs: Generated response y

Pipeline Flow

Compute Parametric Dist P(y|x)
Compute Contextual Dist P(y|c,x)
Calculate JSD(Parametric, Contextual)
Compute Dynamic Alpha based on JSD
Generate Final Token using Adjusted Dist

System Modules

Parametric Distribution Computer (Distribution Computation)

Calculate next-token probabilities based only on the query (ignoring context)

Model or implementation: Base LLM (e.g., Llama-3-70B)

Contextual Distribution Computer (Distribution Computation)

Calculate next-token probabilities based on query AND context

Model or implementation: Base LLM (e.g., Llama-3-70B)

Conflict Measurer (JSD)

Quantify conflict between parametric and contextual distributions using Jensen-Shannon Divergence

Model or implementation: Mathematical Function

Decoder

Sample token from the re-weighted distribution

Model or implementation: Mathematical Function

Novel Architectural Elements

Instance-level and token-level dynamic alpha calculation using Jensen-Shannon Divergence (JSD) instead of fixed hyperparameters

Modeling

Base Model: Llama-2 (13B), Llama-3 (8B, 70B), Mistral (7B) - both base and instruct versions

Compute: Inference-only method. Requires two forward passes per token (one with context, one without). Evaluation performed on NVIDIA A6000 GPUs.

Comparison to Prior Work

vs. CAD: AdaCAD uses dynamic, per-token weights based on JSD rather than a fixed global alpha
vs. COIECD: AdaCAD uses a continuous measure of conflict rather than binary classification/bins
vs. ConfCD: AdaCAD measures divergence between distributions (JSD) rather than just raw confidence of the model
+ 1 more
vs. DoLa: Decoding by Contrasting Layers [not cited in paper]: DoLa contrasts layers within one model run, while AdaCAD contrasts input contexts (with vs without).

Limitations

Incurs 2x inference cost (requires forward pass with and without context)
Warmup parameter (lambda) required for long-form generation to handle low initial JSD
Relies on the assumption that JSD accurately proxies 'truthfulness' of context over parameters (which holds for RAG but maybe not adversarial contexts)

Reproducibility

Code: https://github.com/HanNight/AdaCAD

Code is publicly available at https://github.com/HanNight/AdaCAD. Method is training-free and relies on standard probability outputs from LLMs. Hyperparameters (warmup lambda) are specified (0.3 for long-form).

📊 Experiments & Results

Evaluation Setup

Zero-shot QA and Summarization

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
NQ-Swap (QA with synthetic knowledge conflict)
TriviaQA (Open-domain QA)
PopQA (Long-tail QA)
HotpotQA (Multi-hop QA)
TabMWP (Table-based QA)
CNN-DM / XSum / TofuEval (Summarization)

Metrics:

Exact Match (QA)
ROUGE-L (Summarization)
BERT-P (Summarization)
AlignScore (Summarization Factuality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
QA performance across varying conflict levels shows AdaCAD generalizing better than baselines.
Average across 6 QA datasets	Exact Match Accuracy	39.26	53.47	+14.21
Average across 6 QA datasets	Exact Match Accuracy	48.65	53.47	+4.82
Average across 6 QA datasets	Exact Match Accuracy	51.06	53.47	+2.41
Performance on High-Conflict scenarios specifically.
NQ-Swap (High Conflict)	Exact Match Accuracy	63.60	73.89	+10.29
Summarization factuality results.
CNN-DM	AlignScore	90.81	94.97	+4.16
TofuEval (Marginal Topic)	AlignScore	62.58	80.06	+17.48

Experiment Figures

Comparison of probability adjustments by CAD vs AdaCAD on high vs low conflict examples.

Main Takeaways

AdaCAD consistently outperforms static baselines (CAD) which degrade performance on low-conflict data by over-adjusting.
The method is particularly effective on mixed datasets where some examples have conflict and others do not, effectively acting as 'Greedy' when no conflict exists and 'CAD' when conflict exists.
In summarization, AdaCAD significantly improves factuality (AlignScore), reducing hallucinations particularly on marginal topics in TofuEval.

📚 Prerequisite Knowledge

Prerequisites

Language Modeling probability distributions
Contrastive Decoding / Context-Aware Decoding (CAD)
Information Theory (Divergence metrics)

Key Terms

CAD: Context-Aware Decoding—a method that amplifies the difference between output probabilities with and without context to favor contextual knowledge

Jensen-Shannon Divergence: A symmetric, bounded measure of similarity between two probability distributions, used here to quantify knowledge conflict

Parametric Knowledge: Information stored within the LLM's pre-trained weights

Contextual Knowledge: Information provided in the input prompt (e.g., from retrieval)

Knowledge Conflict: Situations where the information in the context contradicts the model's parametric knowledge

PMI: Pointwise Mutual Information—used in CAD to scale the parametric probability

AlignScore: A metric for evaluating the factual consistency of text generation against a source document

COIECD: A baseline method that bins instances into high/low conflict using entropy constraints

ConfCD: A baseline method that adjusts decoding weights based on model confidence

NQ-Swap: A dataset constructed by swapping answers in Natural Questions to create synthetic knowledge conflicts