LLMs Must Be Taught to Know What They Don’t Know

📝 Paper Summary

Hallucination suppression Knowledge internalization

This paper establishes a framework for honest AI alignment by training models to distinguish known from unknown questions and appropriately refuse to answer the latter without becoming overly conservative.

Core Problem

LLMs frequently hallucinate answers to questions they do not know, and aligning them for honesty is difficult because models lack internal transparency regarding their own knowledge boundaries.

Why it matters:

Users cannot trust model outputs if the AI fabricates coherent but factually incorrect information (hallucinations)
Current alignment efforts focus heavily on helpfulness and harmlessness, leaving honesty under-explored despite its critical role in AI safety
Simply training models to refuse answers can lead to excessive caution, where models refuse to answer questions they actually know

Concrete Example: A model might confidently provide a fabricated birthdate for a minor historical figure it wasn't trained on. Conversely, an overly cautious model might refuse to answer 'What is the capital of France?' despite knowing the answer.

Key Novelty

External Approximation of Knowledge Boundaries for Honesty Alignment

Defines honesty based on Confucian philosophy: saying 'I know' when you know, and 'I don't know' when you don't
Approximates a model's internal 'knowledge' by checking if it can answer a question correctly; if it can't, it should be trained to refuse
Introduces 'prudence' and 'over-conservativeness' metrics to measure the trade-off between refusing unknown questions and answering known ones

Architecture

The framework for honesty alignment, contrasting general alignment with honesty-specific alignment.

Evaluation Highlights

Proposed fine-tuning methods significantly improve honesty scores compared to unaligned baselines on knowledge-intensive tasks
Achieves high prudence (refusing unknown questions) while maintaining low over-conservativeness (still answering known questions)
Demonstrates that alignment for honesty has a low 'tax' on general helpfulness

Breakthrough Assessment

7/10

Establishes a solid formal definition and metric framework for a vague concept (honesty) and provides practical fine-tuning baselines, though the method relies on external correctness rather than internal state.

⚙️ Technical Details

Problem Definition

Setting: Aligning a base model M_t to an honest version M_{t+1} using supervised fine-tuning

Inputs: Natural language question x

Outputs: Response y, categorized as 'correct', 'wrong', or 'idk' (I don't know)

Pipeline Flow

Knowledge Assessment (Determine if model knows x)
Data Synthesis (Generate training data based on knowledge boundaries)
Fine-tuning (Train model to output correct answers or 'idk')

System Modules

Categorization Function c(x, y)

Classifies a response as 'correct', 'wrong', or 'idk' based on substring matching of the ground truth or idk keywords

Model or implementation: Rule-based matcher

SFT Trainer

Fine-tunes the base model on synthesized honesty data

Model or implementation: LLaMA-2-7B-Chat or LLaMA-2-13B-Chat

Modeling

Base Model: LLaMA-2-7B-Chat and LLaMA-2-13B-Chat

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Maximize likelihood of honest responses (correct answer if known, 'idk' if unknown).

Formally: Standard Cross-Entropy Loss on synthesized data.

Adaptation: Full fine-tuning (implied by context of SFT alignment)

Training Data:

Determine known/unknown questions using the base model's zero-shot performance on training sets (e.g., TriviaQA, NQ, SQuAD)
Construct (x, idk) pairs for unknown questions
Construct (x, correct_answer) pairs for known questions

Comparison to Prior Work

vs. Prompting: The paper shows prompting alone is insufficient for honesty compared to fine-tuning
vs. Calibration methods (e.g., Lin et al., 2022a): Focuses on binary refusal (known/unknown) rather than confidence scores/probabilities

Limitations

Relies on external behavior (correctness) to approximate internal knowledge, which is an imperfect proxy
Does not address cases where alignment helps the model 'recover' knowledge it previously got wrong (the 'unknown -> correct' transition)
Does not explicitly model the 'wrong -> wrong' case (catastrophic forgetting) as dishonesty, but tracks it via accuracy

Reproducibility

Code: https://github.com/GAIR-NLP/alignment-for-honesty

Publicly available code and resources at https://github.com/GAIR-NLP/alignment-for-honesty. Detailed prompts provided in Table 2.

📊 Experiments & Results

Evaluation Setup

Knowledge-intensive Question Answering

Benchmarks:

TriviaQA (Open-domain QA)
Natural Questions (NQ) (Open-domain QA)
SQuAD (Reading Comprehension/QA)
TruthfulQA (Truthfulness benchmark)
MMLU (General Knowledge)

Metrics:

Prudence Score (ability to refuse unknown)
Over-conservativeness Score (tendency to refuse known)
Honesty Score (harmonic mean of prudence and inverse over-conservativeness)
Accuracy (on known questions)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper introduces metrics but specific numeric tables for the performance of their proposed fine-tuning method vs baselines are not provided in the text snippet. The snippet focuses on definitions and methodology. Result tables are referenced (e.g., 'extensive experiments reveal...') but not fully content-dumped in the provided text. Therefore, specific delta values cannot be extracted.

Main Takeaways

Prompting alone is insufficient for enforcing honesty in LLMs.
Honesty-oriented Supervised Fine-Tuning (SFT) effectively improves the model's ability to refuse unknown questions (Prudence).
The proposed methods generalize well across various knowledge-intensive QA tasks.
Alignment for honesty does not significantly reduce the model's helpfulness (low 'alignment tax').

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Language Model Alignment (Helpful, Harmless, Honest - HHH)
Knowledge Boundaries in LLMs

Key Terms

idk response: A response where the model explicitly admits inability to answer (e.g., 'I'm not able to', 'I'm not familiar with')

prudence score: A metric measuring the model's ability to refuse answering questions it does not know or would answer incorrectly

over-conservativeness score: A metric measuring the tendency of the model to refuse answering questions it actually knows and could answer correctly

alignment tax: The potential degradation in performance on other capabilities (like helpfulness) caused by aligning the model for a specific trait (like honesty)

knowledge boundary: The distinction between what information a model has accurately encoded during pre-training and what it has not

HHH: Helpful, Harmless, Honest—the three key criteria for AI alignment