CoCoA: Confidence-and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

📝 Paper Summary

Modularized RAG pipeline Hallucination suppression

CoCoA dynamically adjusts reliance on external context during decoding by measuring the entropy gap, Rényi divergence, and contextual peakedness to resolve conflicts between model priors and retrieved evidence.

Core Problem

Standard RAG decoding often fails when retrieved context conflicts with the model's internal memory (model stubbornness), while existing contrastive methods use static weights or uniform divergence metrics that over-correct in low-conflict scenarios.

Why it matters:

Language models frequently prioritize outdated internal knowledge over up-to-date retrieved context, leading to hallucinations in RAG systems
Current adaptive methods like AdaCAD saturate on peaked distributions and fail to distinguish meaningful context signals from noise, degrading performance when context and memory actually agree

Concrete Example: In a QA task, if the model 'knows' the answer is A but the context says B, standard decoding might output A. AdaCAD might force B even if the context is noisy. CoCoA detects the specific 'peakedness' of B in the context distribution to trust B only when the signal is strong and the conflict is meaningful.

Key Novelty

Confidence- and Context-Aware Adaptive Decoding (CoCoA)

Uses Rényi divergence instead of Jensen-Shannon Divergence to detect 'tail-heavy' shifts, making the model sensitive to subtle conflicts where the context boosts a low-probability token
Introduces 'contextual peakedness' (margin between top-2 tokens) combined with entropy gap to measure how certain the context is, ensuring the model only yields to context when the context is confident
Employs a dynamic gating mechanism that blends prior and context distributions based on a conflict score derived from divergence and uncertainty measures

Architecture

Conceptual framework of CoCoA comparing standard decoding, CAD, and CoCoA. It illustrates how CoCoA blends distributions using conflict and confidence measures.

Evaluation Highlights

Achieves up to +9.2 points average accuracy improvement over the strong baseline AdaCAD across QA benchmarks (NQ, TriviaQA, PopQA, HotpotQA, TabMWP)
Outperforms GPT-4o-mini by +4.43 ROUGE-L and +2.10 FaithScore on CLAPNQ when applied to Llama-3.1-70B
Attains 86.32 AlignScore on TofuEval summarization, surpassing greedy decoding by 9.66 points and AdaCAD by 1.25 points

Breakthrough Assessment

8/10

Strong, consistent improvements over state-of-the-art adaptive decoding methods (AdaCAD) across diverse tasks. The use of Rényi divergence and entropy gap offers a theoretically grounded improvement for handling subtle distribution shifts.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive generation where the model has access to both a prior distribution p(y|x) and a context-aware distribution p(y|c, x)

Inputs: Query x and retrieved context c

Outputs: Generated sequence y = (y_1, ..., y_T)

Pipeline Flow

Compute Prior & Context Distributions
Calculate Conflict Signals (Rényi & Entropy)
Determine Adaptive Weight
Token Selection

System Modules

Distribution Computation

Generate next-token logits for both p(y|x) (prior) and p(y|c, x) (context)

Model or implementation: LLaMA-2/3 or Mistral (frozen)

Conflict Detector (Gating Mechanism)

Measure disagreement using Rényi divergence (tail sensitivity) and Entropy Gap (uncertainty change)

Model or implementation: Mathematical function

Confidence Estimator (Gating Mechanism)

Calculate contextual peakedness (margin between top-2 tokens) to assess context certainty

Model or implementation: Mathematical function

Adaptive Blender

Compute blending weight lambda and interpolate logits to sample the next token

Model or implementation: Power interpolation formula

Novel Architectural Elements

Use of Rényi divergence (alpha < 1) specifically to detect low-probability (tail) conflicts
Integration of 'contextual peakedness' margin directly into the gating function to trust confident contexts
Hybrid conflict score combining divergence and entropy gap

Modeling

Base Model: Llama-2 (13B), Llama-3 (8B, 70B), Mistral (7B)

Compute: Inference-only method; no training reported

Comparison to Prior Work

vs. AdaCAD: Uses Rényi divergence instead of JSD to capture tail shifts; incorporates contextual peakedness to avoid saturating on heavy-tailed distributions
vs. CAD: Dynamic token-level weighting instead of a fixed global parameter alpha
vs. DoLa [not cited in paper]: CoCoA focuses on context vs. prior conflict, whereas DoLa contrasts layers within the same model to reduce hallucination

Limitations

Relies on the quality of retrieved context; if context is confidently wrong (high peakedness), the model may be misled
Increases inference cost slightly due to computing two forward passes (prior and context) plus divergence metrics at every step
Performance gains on very large models (Llama-3-70B) are present but sometimes smaller compared to smaller models

Reproducibility

Code: https://github.com/infusion-zero-edit/CoCoA/

Code is publicly available at https://github.com/infusion-zero-edit/CoCoA/. Hyperparameters are explicitly stated: alpha=0.5 (Rényi), z=5.0 (peakedness weight), gamma=1.0 (entropy weight), delta=1e-8. Uses gold contexts provided by datasets.

📊 Experiments & Results

Evaluation Setup

Zero-shot greedy decoding on QA, Summarization, and LFQA tasks using gold contexts

Benchmarks:

Natural Questions (NQ), TriviaQA, PopQA, HotpotQA, NQ-SWAP (Question Answering)
TabMWP (Table-based QA)
CNN-DM, XSum, TofuEval (Summarization)
CLAPNQ, ExpertQA, HAGRID, ELI5-WebGPT, QuoteSum (Long-Form QA)

Metrics:

Exact Match (QA)
ROUGE-L
BERT-P
AlignScore (Factuality)
FaithScore (MiniCheck consistency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
QA Performance: CoCoA consistently outperforms baselines on Llama-3-8B.
PopQA	Exact Match	46.10	58.52	+12.42
TriviaQA	Exact Match	78.43	86.33	+7.90
Summarization Factuality: CoCoA improves factual alignment in summarization tasks.
TofuEval (Main Topics)	AlignScore	85.07	86.32	+1.25
XSum	AlignScore	85.81	87.94	+2.13
Long-Form QA: CoCoA matches or beats closed models using open weights.
CLAPNQ	ROUGE-L	37.72	42.15	+4.43
CLAPNQ	FaithScore	90.35	92.45	+2.10

Main Takeaways

Consistent gains in low-conflict settings: Unlike CAD which degrades performance when context and memory agree, CoCoA maintains or improves accuracy (e.g., NQ, TriviaQA).
Superiority in high-conflict settings: On NQ-SWAP (synthetic conflict), CoCoA significantly outperforms AdaCAD, showing better sensitivity to contradictions.
Generalization to Long-Form: CoCoA effectively reduces hallucinations in LFQA and summarization, evidenced by higher AlignScore and FaithScore.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive decoding
KL Divergence and Jensen-Shannon Divergence
Entropy and information theory concepts
Retrieval-Augmented Generation (RAG)

Key Terms

Rényi divergence: A generalization of KL divergence that includes a tunable parameter alpha; used here to emphasize differences in the tails of probability distributions

entropy gap: The difference in uncertainty (entropy) between the model's prior distribution and the context-aware distribution

contextual peakedness: A measure of confidence defined as the probability margin between the top-1 and top-2 tokens in the context-aware distribution

AdaCAD: A baseline adaptive contrastive decoding method that uses Jensen-Shannon Divergence to dynamically weight context

model stubbornness: The tendency of LLMs to stick to their pre-trained parametric knowledge even when presented with contradictory retrieved evidence

AlignScore: A metric for evaluating factual consistency between a source document and a generated summary

FaithScore: A metric (based on MiniCheck) measuring how well generated responses are grounded in the provided context