Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

📝 Paper Summary

Modularized RAG pipeline Hallucination suppression

CK-PLUG is an inference-time method that detects knowledge conflicts via entropy shifts and modulates token probabilities to control whether an LLM relies on its internal parameters or retrieved context.

Core Problem

RAG systems struggle to balance reliance between internal parametric knowledge and external retrieved context, especially when conflicts arise (e.g., outdated internal knowledge vs. noisy retrieval).

Why it matters:

Excessive reliance on noisy retrieval leads to hallucinations, while ignoring retrieval for outdated models causes factual errors
Current alignment methods for factuality or faithfulness are often unidirectional and lack flexibility to adapt to varying retrieval qualities at inference time
Users need customizable control to prioritize either internal reliability or external evidence depending on the deployment scenario (e.g., trusted professional retrieval vs. adversarial web data)

Concrete Example: When asked 'Where is London?', an LLM might internally know 'England' but retrieve a counterfactual context saying 'London is in France'. Without control, the model confusingly blends information. CK-PLUG allows users to set a parameter α to force the answer to 'England' (parametric) or 'France' (contextual) as needed.

Key Novelty

Confidence Gain (CG) driven decoding modulation

Introduces 'Confidence Gain', a metric measuring the entropy shift in token distributions before and after context injection to detect knowledge conflicts
Uses a plug-and-play decoding strategy that blends parameter-aware and context-aware probability distributions only when conflicts are detected
Provides a single scalar α to manually tune reliance, or an adaptive mode that self-regulates based on model confidence without retraining

Architecture

The CK-PLUG inference pipeline showing the parallel computation of parametric and context-aware distributions and their fusion.

Evaluation Highlights

Adjusts Memory Recall (MR) on LLaMA3-8B from 9.9% to 71.9% in counterfactual scenarios, significantly widening the control range compared to the fixed baseline of 42.1%
Achieves consistent performance improvements across six diverse RAG tasks (including NQ, HotpotQA, FEVER) using the adaptive auto-configuration mode
Maintains generation fluency and accuracy while modulating knowledge preference, validated by hit rates comparable to baselines even under strong control settings

Breakthrough Assessment

7/10

Offers a lightweight, training-free solution to a critical RAG problem (knowledge conflicts). The ability to linearly control reliance is practical, though the core mechanism is a decoding heuristic rather than a fundamental architectural change.

⚙️ Technical Details

Problem Definition

Setting: Next-token prediction in RAG where the model has access to both a query X_q and retrieved context X_r

Inputs: Query X_q, Retrieved Context X_r, Control parameter α (optional)

Outputs: Generated token sequence x reflecting desired knowledge reliance

Pipeline Flow

Input Processing: Receive query X_q and context X_r
Dual Forward Pass: Compute logits for P(x|X_q) (parametric) and P(x|X_q + X_r) (contextual)
Conflict Detection: Calculate Confidence Gain (CG) based on entropy difference between the two distributions
Distribution Modulation: If CG is negative (conflict), fuse distributions using weight α; otherwise retain original
Decoding: Sample next token from modulated distribution

System Modules

Parametric Encoder (Inference)

Generate token probabilities based solely on internal knowledge

Model or implementation: LLM (e.g., LLaMA-3-8B)

Context-Aware Encoder (Inference)

Generate token probabilities based on query + retrieved context

Model or implementation: LLM (Shared weights with Parametric Encoder)

Conflict Detector (Control)

Identify tokens where context contradicts internal knowledge via entropy shift

Model or implementation: Entropy Calculation Formula

Distribution Modulator (Control)

Adjust final token probabilities to favor parameters or context based on α

Model or implementation: Weighted Logit Fusion

Novel Architectural Elements

Inference-time logit subtractive modulation: P_final ∝ P_rag + α * (P_para - P_rag) strictly applied to conflict tokens
Entropy-based trigger mechanism (Confidence Gain) to selectively activate modulation only on conflicting tokens

Modeling

Base Model: LLaMA-2-7B, LLaMA-3-8B, Mistral-v0.3-7B, Qwen-2.5-7B

Comparison to Prior Work

vs. CAD: CAD globally amplifies context; CK-PLUG allows bidirectional control (towards parameters OR context) via α [not cited in paper]
vs. Standard RAG: Adds a dynamic control layer to resolve conflicts rather than blindly trusting retrieval or parameters
vs. Training-based alignment (e.g., RAG-tuning): CK-PLUG is plug-and-play at inference time requiring no weight updates

Limitations

Requires two forward passes (one with context, one without) for every token generation, increasing inference cost
Effectiveness depends on the base model's ability to represent knowledge in its parameters; smaller models might have weak parametric signals
Linear modulation assumption (using scalar α) might not capture complex non-linear relationships between knowledge sources
Conflict detection relies on entropy, which is a proxy for uncertainty and may not perfectly correlate with factual correctness in all cases

Reproducibility

Code: https://github.com/byronBBL/CK-PLUG

Code is publicly available at https://github.com/byronBBL/CK-PLUG. The method is training-free and relies on standard logit manipulation during inference. Datasets (NQ, HotpotQA, etc.) are standard public benchmarks.

📊 Experiments & Results

Evaluation Setup

RAG with counterfactual contexts to force knowledge conflicts, plus general RAG benchmarks

Benchmarks:

Natural Questions (NQ) (Open-Domain QA)
ConFiQA (Long-context QA with counterfactuals)
MQuAKE (Multi-hop QA with knowledge editing)
KILT Benchmark (HotpotQA, FEVER, T-REX, ELI5, WOW) (Various RAG tasks)

Metrics:

Memory Recall (MR)
Context Recall (ConR)
Parameter Recall (ParR)
Normalized Accuracy
Rouge-L
F1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Control experiments on LLaMA3-8B using Natural Questions (NQ) with counterfactual contexts show CK-PLUG's ability to shift reliance.
NQ (Counterfactual)	Memory Recall (MR)	42.09	9.89	-32.20
NQ (Counterfactual)	Memory Recall (MR)	42.09	71.93	+29.84
General RAG performance using the adaptive (α-free) mode across KILT tasks shows consistent improvements over baselines.
HotpotQA	Accuracy	34.6	35.8	+1.2
FEVER	Accuracy	69.6	73.2	+3.6
NQ (Counterfactual)	Hit Rate	29.2	46.2	+17.0

Experiment Figures

Entropy changes of probability distributions before and after context insertion.

Probability of first parametric vs. contextual token as α varies.

Main Takeaways

CK-PLUG enables fine-grained control of knowledge reliance: MR can be swept from ~10% to ~70% on LLaMA3-8B by tuning α.
The adaptive mode, which sets α based on perplexity ratios, consistently outperforms standard RAG across diverse downstream tasks without manual tuning.
Conflict Detection (ConD) is crucial; indiscriminately modulating all tokens degrades generation quality, while targeted modulation maintains fluency.
Qwen models exhibit higher inherent confidence in parametric knowledge, leading to a different modulation curve compared to LLaMA models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer decoding and next-token probability distributions
Familiarity with RAG and the problem of knowledge conflicts (parametric vs. contextual)
Basic information theory (Shannon entropy)

Key Terms

Confidence Gain (CG): A metric defined as the difference in entropy between the model's parametric prediction (query only) and the RAG prediction (query + context); negative CG indicates a knowledge conflict

Memory Recall (MR): A metric measuring how often the model generates an answer based on its internal parameters rather than the retrieved context

Parametric Knowledge: Information stored in the LLM's pre-trained weights

Contextual Knowledge: Information provided in the retrieved documents or prompt

Perplexity: A measurement of how uncertain a probability model is about its predictions; calculated here using the entropy of the token distribution

ConR: Recall of Context—percentage of answers matching the retrieved information

ParR: Recall of Parameters—percentage of answers matching the model's internal knowledge

Contrastive Decoding: A decoding strategy that manipulates logits by contrasting two distributions (e.g., strong vs. weak model, or here, context-aware vs. parameter-aware)