← Back to Paper List

Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, Besmira Nushi
Microsoft Research, Stanford University, University of Illinois Urbana-Champaign
arXiv (2023)
Factuality Benchmark

📝 Paper Summary

Hallucination suppression Mechanistic interpretability
SAT Probe predicts factual errors in Large Language Models by monitoring how strongly the model attends to specific constraint tokens in the prompt during generation.
Core Problem
LLMs frequently generate confident but factually incorrect text (hallucinations), and existing detection methods either treat the model as a black box (expensive) or focus only on correct recall mechanisms.
Why it matters:
  • Safety-critical applications require reliable factuality, but LLMs often produce hallucinations that look confident.
  • Current black-box verification methods (e.g., self-critique) are often unreliable or prohibitively expensive due to multiple generation steps.
  • Prior mechanistic work focuses on how facts are retrieved correctly, leaving the mechanisms of failure and error generation largely unexplored.
Concrete Example: In a query like 'What year was basketball player [Name] born?', if the model fails to attend strongly to the specific player's name (the constraint) while generating the year, it is likely to hallucinate an incorrect date.
Key Novelty
Constraint Satisfaction Problem (CSP) framework for Factuality
  • Models factual queries as Constraint Satisfaction Problems (CSPs), where specific entities in the prompt (e.g., a director's name) act as constraints that the answer must satisfy.
  • Identifies a strong correlation between the intensity of attention paid to these constraint tokens and the factual accuracy of the output.
  • Proposes SAT Probe, a lightweight classifier that uses these internal attention patterns to predict whether a generated response will be factually correct or incorrect.
Evaluation Highlights
  • SAT Probe predicts factual errors with performance comparable to the LLM's own confidence scores, but using only attention patterns.
  • Can predict factual errors halfway through the forward pass, allowing computation to be stopped early to save costs.
  • Validated across the Llama-2 family (7B, 13B, 70B) on a suite of 10 datasets containing over 40,000 prompts.
Breakthrough Assessment
7/10
Establishes a novel link between attention patterns and hallucination, offering a white-box alternative to confidence scores. Valuable for efficient error detection, though primarily validated on specific constraint-based tasks.
×