Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts

📝 Paper Summary

Long-context LLMs Attention mechanisms

The paper identifies 'contextual heads' that control attention to relevant information in long contexts and proposes 'focus directions'—vectors added to key/query activations—to steer these heads toward relevant contexts without external labels.

Core Problem

Long-context LLMs often get distracted by irrelevant information within their large context windows, leading to incorrect answers or hallucinations, and the specific attention mechanisms causing this are poorly understood.

Why it matters:

Distraction in long-context tasks (like RAG or summarization) causes models to generate false information or erroneous reasoning despite having the correct answer in context.
Current methods often rely on external labels or retraining to fix this, whereas understanding the internal mechanism could allow for inference-time correction.

Concrete Example: In a multi-document QA task with 20 documents where only one is relevant, the model might fail to answer correctly because its attention heads pay equal or more attention to the 19 irrelevant documents (distractors) rather than the single relevant one.

Key Novelty

Focus Directions for Contextual Heads

Identifies a sparse set of 'contextual heads' that are primarily responsible for focusing on relevant information during generation.
Discovers 'focus directions' in the key and query activation spaces of these heads; adding these vectors at inference time forces the model to attend more to relevant contexts without needing ground-truth labels.

Architecture

Conceptual workflow: (1) Measuring attention to identify contextual heads using labeled data, (2) Training focus directions (vectors) on Key/Query activations, and (3) Injecting these directions at inference to boost attention to relevant contexts.

Evaluation Highlights

+7.7% Exact Match accuracy improvement (from 59.4% to 67.1%) on a multi-document QA task by applying focus directions to the top-20 contextual heads.
Demonstrates that intervening on just 20 'contextual heads' (out of 672) is more effective than intervening on random heads, verifying the sparsity and specificity of the mechanism.
Shows that 'negative' focus directions (subtracting the vector) drastically reduce performance (down to ~32%), confirming these directions control the attention intensity.

Breakthrough Assessment

7/10

Provides a strong mechanistic explanation for long-context distraction and a novel inference-time intervention (focus directions) that improves performance without retraining, though tested primarily on one specific QA format.

⚙️ Technical Details

Problem Definition

Setting: Multi-document Question Answering where the input contains instructions, a set of documents (one relevant, many irrelevant), and a query.

Inputs: Prompt containing instructions, multiple context documents, and a question.

Outputs: Generated answer based on the single relevant document.

Pipeline Flow

Contextual Scoring (Identify Heads) -> Focus Direction Training (Find Vectors) -> Inference Intervention (Apply Vectors)

System Modules

Contextual Head Identifier (Analysis / Training)

Identify which attention heads pay the most attention to relevant contexts using a scoring metric based on attention weights.

Model or implementation: Llama-3.2-3B-Instruct

Focus Direction Trainer (Analysis / Training)

Learn vectors d_K and d_Q for the identified heads that, when added, maximize attention to relevant contexts.

Model or implementation: Llama-3.2-3B-Instruct

Inference Intervener

Modify Key and Query activations during inference by adding the learned focus direction vectors scaled by alpha.

Model or implementation: Llama-3.2-3B-Instruct

Novel Architectural Elements

Focus Direction Injection: Modifying the Key and Query activations of specific 'contextual heads' at inference time to steer attention, rather than modifying weights or residual streams.

Modeling

Base Model: Llama-3.2-3B-Instruct

Training Method: Vector learning via gradient descent on cached activations (lightweight training)

Objective Functions:

Purpose: Maximize attention to relevant context.

Formally: L = -S^d_C, where S^d_C is the summed attention weight on relevant context spans after adding direction vectors.

Training Data:

Multi-Document Question Answering data derived from NaturalQuestions-Open
2654 samples total, split 50/50 for training/testing
Input constructed with 1 relevant document and 19 irrelevant documents

Key Hyperparameters:

learning_rate: 10^-3
optimizer: AdamW
epochs: 10
+ 1 more
alpha_intervention_strength: 0.3 (optimal)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Residual Stream Activation Addition: This paper intervenes specifically on Key and Query activations inside attention heads, rather than the residual stream, to directly control attention patterns.
vs. Split-Softmax [Li et al. 2024a]: Focus directions allow the model to find relevant context implicitly without needing explicit span labels at inference time, whereas Split-Softmax requires knowing the target span.

Limitations

Tested primarily on one model size (3B parameters) and one dataset type (Multi-document QA).
Requires a labeled dataset (relevant vs. irrelevant) to train the focus directions initially.
Overly strong intervention (alpha > 0.5) can increase attention to irrelevant contexts, degrading performance.
Focus directions are applied to a fixed set of heads identified offline, not dynamically adapted per sample.

Reproducibility

The paper provides dataset construction details (based on NaturalQuestions-Open and Liu et al., 2024). It specifies the exact model (Llama-3.2-3B-Instruct). Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Multi-Document QA with 20 total documents (1 relevant, 19 distractors). Relevant document placed at varying positions (1, 5, 10, 15, 20).

Benchmarks:

Multi-Document QA (based on NaturalQuestions-Open) (Long-context retrieval-augmented generation) [New]

Metrics:

Exact Match (EM) accuracy
Contextual Score (Attention weight on relevant span)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Multi-Document QA	EM	59.4	67.1	+7.7
Multi-Document QA	EM	59.4	32.0	-27.4
Multi-Document QA	EM	59.4	45.8	-13.6
Multi-Document QA	Relevant Contextual Score	0.36	0.41	+0.05

Experiment Figures

Distribution of 'Relevant Contextual Scores' across all attention heads in the model.

EM Accuracy vs. Number of Heads Intervened using Split-Softmax (explicit label guidance).

EM Accuracy using Focus Directions (inference-time steering) across different intervention strengths (alpha).

Main Takeaways

Contextual heads are sparse (only ~5.5% of heads) and mostly located in middle-to-late layers (8-18).
Increasing attention on contextual heads significantly improves performance, while intervening on random heads has little to no positive effect.
Focus directions function primarily by shifting attention away from 'attention sinks' (start tokens) toward the relevant context.
There is a 'sweet spot' for intervention strength (alpha=0.3); too much intervention amplifies attention to irrelevant distractors as well.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanism (Keys, Queries, Values)
Mechanistic interpretability (activation steering/addition)
Long-context LLM challenges ('Lost in the Middle')

Key Terms

contextual heads: A specific subset of attention heads in a Transformer that allocate the most attention weight to relevant context spans during correct answer generation.

focus directions: Vectors found in the key and query activation spaces that, when added to the model's activations, increase the attention weights assigned to relevant context spans.

split-softmax: A technique to artificially re-weight attention distributions by applying different scaling factors to specific token spans (e.g., boosting relevant context) before the softmax normalization.

contextual scoring: A metric proposed in this paper to quantify how much an attention head focuses on the gold-standard relevant context tokens during the generation of response tokens.

activation addition: A steering method where a specific vector is added to the internal representations (activations) of a model during inference to modify its behavior.

attention sink: The phenomenon where attention heads allocate a large amount of attention to the initial tokens (like the start-of-sequence token) regardless of their semantic importance.

Exact Match (EM): An evaluation metric that counts a prediction as correct only if it exactly matches one of the ground truth answers.