← Back to Paper List

LLMs Must Be Taught to Know What They Don’t Know

(NYU) Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah GOldblum, Andrew Gordon Wilson
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Fudan University, Carnegie Mellon University, Generative AI Research Lab (GAIR)
NeurIPS (2024)
Factuality Benchmark QA

📝 Paper Summary

Hallucination suppression Knowledge internalization
This paper establishes a framework for honest AI alignment by training models to distinguish known from unknown questions and appropriately refuse to answer the latter without becoming overly conservative.
Core Problem
LLMs frequently hallucinate answers to questions they do not know, and aligning them for honesty is difficult because models lack internal transparency regarding their own knowledge boundaries.
Why it matters:
  • Users cannot trust model outputs if the AI fabricates coherent but factually incorrect information (hallucinations)
  • Current alignment efforts focus heavily on helpfulness and harmlessness, leaving honesty under-explored despite its critical role in AI safety
  • Simply training models to refuse answers can lead to excessive caution, where models refuse to answer questions they actually know
Concrete Example: A model might confidently provide a fabricated birthdate for a minor historical figure it wasn't trained on. Conversely, an overly cautious model might refuse to answer 'What is the capital of France?' despite knowing the answer.
Key Novelty
External Approximation of Knowledge Boundaries for Honesty Alignment
  • Defines honesty based on Confucian philosophy: saying 'I know' when you know, and 'I don't know' when you don't
  • Approximates a model's internal 'knowledge' by checking if it can answer a question correctly; if it can't, it should be trained to refuse
  • Introduces 'prudence' and 'over-conservativeness' metrics to measure the trade-off between refusing unknown questions and answering known ones
Architecture
Architecture Figure Figure 2
The framework for honesty alignment, contrasting general alignment with honesty-specific alignment.
Evaluation Highlights
  • Proposed fine-tuning methods significantly improve honesty scores compared to unaligned baselines on knowledge-intensive tasks
  • Achieves high prudence (refusing unknown questions) while maintaining low over-conservativeness (still answering known questions)
  • Demonstrates that alignment for honesty has a low 'tax' on general helpfulness
Breakthrough Assessment
7/10
Establishes a solid formal definition and metric framework for a vague concept (honesty) and provides practical fine-tuning baselines, though the method relies on external correctness rather than internal state.
×