Inference-Time Intervention (ITI): Eliciting truthful answers from a LM

📝 Paper Summary

Hallucination suppression Mechanistic interpretability Activation editing

Inference-Time Intervention (ITI) improves LLM truthfulness by identifying attention heads highly correlated with factual accuracy and shifting their activations towards 'truthful' directions during inference.

Core Problem

LLMs often 'know' the correct answer (evidenced by internal activations) but fail to produce it, instead generating hallucinations or mimicking common misconceptions.

Why it matters:

Correctness is critical in high-stakes contexts, yet models often hallucinate despite having the necessary knowledge internally
Existing methods like RLHF require massive annotation resources and compute, and may encourage sycophancy rather than truthfulness
There is a 'generation-discrimination gap' where models can classify truth accurately but fail to generate it

Concrete Example: When asked a question from TruthfulQA, a standard LLaMA model might mimic a human misconception. However, a probe on its internal activations can classify the true answer with 83.3% accuracy, showing the model 'knows' the truth but doesn't 'say' it.

Key Novelty

Inference-Time Intervention (ITI)

Identifies a sparse set of attention heads that have high linear probing accuracy for distinguishing truth from falsehood
During inference, shifts the activations of these specific heads along a learned 'truthful direction' (vector) to steer generation toward accuracy
Minimally invasive approach that requires no parameter updates (fine-tuning) and uses very few labeled examples (as few as 40)

Architecture

Overview of the ITI method. It shows the process of identifying truthful heads using probes and then shifting activations during inference.

Evaluation Highlights

Improves Alpaca model's truthfulness on TruthfulQA from 32.5% to 65.1%
Achieves 40% difference between probe accuracy and generation accuracy on LLaMA-7B, highlighting the gap ITI aims to close
Locates truthful directions using as few as 40 examples, significantly more efficient than RLHF

Breakthrough Assessment

8/10

Significant improvement in truthfulness with a lightweight, inference-only method. Demonstrates that LLMs have an internal 'world model' of truth distinct from their output behavior.

⚙️ Technical Details

Problem Definition

Setting: Controlling generation behavior of auto-regressive language models to favor truthful outputs

Inputs: Natural language question q

Outputs: Generated answer a

Pipeline Flow

Probe Training (Offline): Identify truthful heads and directions
Inference: Shift activations in targeted heads

System Modules

Probe Trainer

Train linear classifiers on attention head activations to distinguish true/false answers

Model or implementation: Linear Logistic Regression

Activation Shifter

Shift activations of selected heads along the truthful direction during forward pass

Model or implementation: LLaMA/Alpaca (frozen weights)

Novel Architectural Elements

Head-specific activation editing: Intervening only on a sparse subset of attention heads (top-K) rather than the entire residual stream
Geometry-aware shifting: Using directions derived from linear probes (orthogonal or mean-difference) to bias the computation

Modeling

Base Model: LLaMA-7B, Alpaca, Vicuna

Comparison to Prior Work

vs. RLHF: ITI is inference-only (no weight updates), highly data-efficient (40 examples vs. thousands), and computationally cheaper
vs. CCS: ITI uses supervised labels (TruthfulQA) to define 'truth' regarding real-world facts, whereas CCS checks logical consistency
vs. PPLM: ITI modifies activations directly via vector addition, avoiding expensive backward passes during inference
+ 1 more
vs. Activation Addition [not cited in paper]: ITI targets specific attention heads based on probe accuracy, rather than adding vectors to the global residual stream

Limitations

Trade-off between truthfulness and helpfulness/instruction following (CE loss increases with intervention strength)
Relies on the existence of a 'truthful direction' which might not capture complex truths
Only tested on TruthfulQA and a few other benchmarks; generalization to all types of 'truth' is unproven
Requires labeled data (though small amount) to identify directions, unlike unsupervised methods

Reproducibility

Code: https://github.com/likenneth/honest_llama

Code is publicly available at https://github.com/likenneth/honest_llama. The paper uses the TruthfulQA dataset. Hyperparameters K (number of heads) and alpha (intervention strength) are swept.

📊 Experiments & Results

Evaluation Setup

Open-ended question answering evaluated for truthfulness

Benchmarks:

TruthfulQA (Adversarial QA designed to elicit misconceptions)

Metrics:

Truthfulness (% of true answers)
True*Informative (% of true and informative answers)
CE Loss (Cross-Entropy Loss, proxy for helpfulness/coherence)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TruthfulQA	Truthfulness	32.5	65.1	+32.6
TruthfulQA	Probe Accuracy	Not reported in the paper	83.3	Not reported in the paper

Experiment Figures

Heatmap of probe accuracy across layers and heads (A) and visualization of activation geometry (B).

Main Takeaways

Significant gap (approx 40%) exists between what LLaMA 'knows' (probe accuracy) and what it generates
Information about truthfulness is spatially localized: mostly in early-to-middle layers and specific heads
ITI effectively bridges the gap, doubling truthfulness scores on TruthfulQA for Alpaca
Performance depends on intervention strength (alpha); excessive intervention harms helpfulness (measured by CE loss)

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Multi-Head Attention)
Linear probing / Classifiers
Activation engineering / steering vectors

Key Terms

ITI: Inference-Time Intervention—a technique that shifts model activations during inference to steer behavior without changing weights

Attention Head: A component of the Transformer architecture that attends to different parts of the input sequence; ITI targets specific heads

Linear Probe: A simple classifier trained on the internal activations of a neural network to predict a property (here, truthfulness)

Residual Stream: The main vector path in a Transformer where layers read from and write to; ITI modifies activations before they merge back into this stream

TruthfulQA: A benchmark dataset designed to test whether language models mimic human falsehoods and misconceptions

RLHF: Reinforcement Learning from Human Feedback—a standard method for aligning models, which ITI aims to be a cheaper alternative to

MHA: Multi-Head Attention—the mechanism in Transformers allowing the model to focus on different positions

Activation Space: The high-dimensional vector space containing the intermediate outputs of neurons within the model