In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

📝 Paper Summary

Hallucination suppression Mechanistic Interpretability Constrained Decoding

Activation Decoding mitigates hallucinations by favoring tokens that exhibit sharper activation patterns (lower entropy) relative to the prompt's context tokens in intermediate model layers.

Core Problem

Large Language Models (LLMs) frequently generate hallucinations and factual errors, and existing mitigation methods often require expensive external retrieval, fine-tuning, or high-quality knowledge bases.

Why it matters:

Hallucinations undermine the trustworthiness and reliability of LLMs in critical applications.
Resource-intensive methods (retrieval/fine-tuning) are often unavailable for domain-specific cases or constrained environments.
Understanding the mechanistic cause of hallucinations within hidden states remains an open challenge.

Concrete Example: When asking 'The twin city of Boston is', an LLM might answer 'Manila' (incorrect) instead of 'Athens' (correct). The paper shows that 'Athens' has sharp, distinct activations against context words like 'Boston' in intermediate layers, while 'Manila' has vague, high-entropy activations.

Key Novelty

In-Context Sharpness as a proxy for Factuality (Activation Decoding)

Discovers that correct tokens tend to trigger 'sharp' activations (high connection to specific context words) in intermediate layers, while hallucinations have 'flat' or entropic activations.
Proposes 'Contextual Entropy' to quantify this sharpness: low entropy implies the token is strongly grounded in the prompt's context.
Modifies the decoding process (Activation Decoding) to penalize tokens with high contextual entropy, pushing the model toward factually grounded outputs without external data.

Architecture

The Activation Decoding process. It illustrates how the model's original wrong prediction ('Hearing') is corrected to 'Smell' by analyzing activation sharpness.

Evaluation Highlights

Achieves up to +8.6 point improvement on TruthfulQA (Truth*Info metric) with Llama-2-70B-chat compared to greedy decoding.
Outperforms DoLa and ITI baselines on knowledge-seeking datasets (TriviaQA, HotpotQA, NQ), improving F1 by up to 4.8 points on TriviaQA.
Increases inference latency by only 23.4% over greedy decoding, while being 7.3% faster than DoLa.

Breakthrough Assessment

7/10

Offers a strong, lightweight mechanistic solution to hallucination without external knowledge. Performance gains are consistent, though it relies on the assumption that the model internally 'knows' the fact.

⚙️ Technical Details

Problem Definition

Setting: Constrained decoding for open-ended text generation tasks aimed at improving factuality

Inputs: Input prompt sequence of tokens

Outputs: Generated text sequence with minimized factual errors

Pipeline Flow

Input Processing (Standard forward pass)
Candidate Selection (Filter top-k tokens)
Entropy Calculation (Compute contextual entropy for candidates against prompt tokens)
Logit Adjustment (Penalize high-entropy tokens)
Token Selection (Select next token based on adjusted probabilities)

System Modules

Base LLM

Generate initial logits and hidden states for the sequence

Model or implementation: LLaMA-2-chat (7B, 13B, 70B)

Entropy Calculator

Calculate the contextual entropy of candidate tokens relative to input prompt tokens at a specific 'informative layer'

Model or implementation: Mathematical function (Eq. 3 & 4)

Logit Adjuster

Modify the probability distribution of the next token to favor low-entropy (sharp) activations

Model or implementation: Adjustment formula (Eq. 5)

Novel Architectural Elements

Inclusion of an entropy-based penalty term in the decoding objective derived specifically from in-context hidden state projections

Modeling

Base Model: LLaMA-2-chat (7B, 13B, 70B)

Comparison to Prior Work

vs. DoLa: Focuses on 'sharpness' of activation against context tokens at a single informative layer, rather than contrasting output distributions between layers.
vs. ITI: Does not require training probes or labeled data to find intervention directions; purely inference-time statistic.
vs. CAD (Context-Aware Decoding) [not cited in paper]: CAD contrasts logits with/without context; Activation Decoding measures entropy within the context's hidden states directly.

Limitations

Cannot correct errors where the model fundamentally lacks the knowledge (training data errors or outdated facts).
Requires selecting a specific 'informative layer', which is a hyperparameter.
Slight performance drop in 'Truth' metric on TruthfulQA 13B/70B (though 'Truth*Info' improves) due to converting safe refusals into attempted answers.
Relies on the assumption that ground truth is encoded in in-context token hidden states.

Reproducibility

Code: https://github.com/hkust-nlp/ActivationDecoding

Code publicly available at https://github.com/hkust-nlp/ActivationDecoding. Uses standard datasets (TruthfulQA, TriviaQA, etc.) and open models (LLaMA-2). Hyperparameters for specific benchmarks are provided in the paper.

📊 Experiments & Results

Evaluation Setup

Open-ended text generation and multiple-choice QA

Benchmarks:

TruthfulQA (Truthfulness evaluation (Generative & MC))
TriviaQA (Knowledge-seeking QA)
HotpotQA (Multi-hop QA)
Natural Questions (NQ) (Open-domain QA)
COUNTERFACT (Factuality analysis (Case study))

Metrics:

Truth*Info (TruthfulQA)
Truth (TruthfulQA)
Info (TruthfulQA)
Exact Match (EM)
F1 score
AUROC (for error detection analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on TruthfulQA (Generative) showing improvements in balancing truthfulness and informativeness.
TruthfulQA	Truth*Info	47.1	55.7	+8.6
TruthfulQA	Info	78.3	90.0	+11.7
Performance on Knowledge-Seeking Datasets (TriviaQA, HotpotQA, NQ) demonstrating consistent gains in F1 and EM.
TriviaQA	F1 score	68.4	73.2	+4.8
HotpotQA	F1 score	21.7	26.4	+4.7
Natural Questions	F1 score	28.9	32.5	+3.6
Validation of the core hypothesis using AUROC to detect factual errors based on entropy.
GF-CFT (CounterFact)	AUROC	66.83	70.79	+3.96

Experiment Figures

Visualization of activation maps for correct vs incorrect tokens across layers.

Distribution of entropy values for Ground Truth vs Ground False answers.

Main Takeaways

Contextual entropy is a reliable indicator of hallucination: correct answers have 'sharp' activations against context; incorrect ones have high entropy.
Activation Decoding consistently improves factuality (F1/EM) across model sizes (7B, 13B, 70B) and datasets.
The method reduces uninformative refusals (e.g., 'I have no comment'), significantly boosting the Informativeness score on TruthfulQA.
Combining Activation Decoding with DoLa yields further marginal improvements, suggesting the methods capture complementary factual signals.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (embeddings, hidden states, LM head)
Language model decoding strategies (greedy, beam search)
Entropy (information theory)
Mechanistic interpretability (projections, activations)

Key Terms

contextual entropy: A metric quantifying how 'spread out' a candidate token's activation is across the input prompt tokens; lower entropy indicates sharper, more specific focus.

in-context sharpness: The phenomenon where correct tokens exhibit strong, distinct activations with specific tokens in the context prompt at intermediate layers.

DoLa: Decoding by Contrasting Layers—a baseline method that contrasts logits from different layers to amplify factual signals.

ITI: Inference-Time Intervention—a baseline that shifts model activations along 'truthful' directions discovered via probing.

activation score: The projection of a hidden state onto the vocabulary embedding of a specific token, measuring how likely that hidden state encodes the token.

informative layer: A specific intermediate transformer layer (e.g., layer 26 in LLaMA-2-7B) selected for calculating activation patterns, believed to contain factual associations.

Truth*Info: A composite metric for TruthfulQA that multiplies the truthfulness score by the informativeness score to penalize non-answers like 'I have no comment'.