CoKE: Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals

📝 Paper Summary

Hallucination suppression Knowledge internalization

CoKE teaches LLMs to identify their own knowledge boundaries using internal confidence signals and fine-tunes them to honestly decline unknown questions while answering known ones.

Core Problem

LLMs struggle to admit ignorance because their training encourages generating text even when they lack relevant parametric knowledge, leading to hallucinations in 'Unknown Unknowns' scenarios.

Why it matters:

Hallucinations severely undermine user trust, preventing the adoption of LLMs in high-stakes domains like medicine and law
Current methods often rely on costly external annotations or complex uncertainty thresholds that are hard for users to manage
Models often lack consistency, expressing ignorance in one phrasing but hallucinating an answer when prompted differently about the same fact

Concrete Example: When asked 'panda is a national animal of which country', an LLM might hallucinate an answer if it doesn't know. CoKE enables it to detect low internal confidence (e.g., Min-Prob) and output 'I don't know' instead.

Key Novelty

Confidence-derived Knowledge Boundary Expression (CoKE)

Probes the model's internal confidence (using minimum token probability) on unlabeled questions to automatically classify them as 'known' or 'unknown' without external supervision
Fine-tunes the model to align its verbal responses with these internal signals, teaching it to refuse 'unknown' questions and answer 'known' ones
Uses a multi-prompt consistency loss to ensure the model maintains the same knowledge boundary across different question formulations (prior, direct, and posterior awareness)

Architecture

The CoKE method workflow consisting of two stages: Probing and Training.

Evaluation Highlights

Significant improvement in knowledge boundary awareness (S_aware) on in-domain datasets (e.g., +26.3% on TriviaQA with Llama3-8B-Instruct)
Strong generalization to out-of-domain datasets (e.g., +13.6% on TruthfulQA with Llama3-8B-Instruct)
Reduces 'Unknown Unknowns' (hallucinations where the model is unaware of its ignorance) while maintaining high accuracy on known questions

Breakthrough Assessment

7/10

Offers a practical, unsupervised method for hallucination reduction via refusal. The consistency regularization is a clever addition, though reliance on simple probability thresholds is a known technique.

⚙️ Technical Details

Problem Definition

Setting: Single-hop factual question answering where the model must decide to answer or decline based on parametric knowledge

Inputs: Natural language question q

Outputs: Answer y or a refusal response (expression of ignorance)

Pipeline Flow

Probing Stage: Generate answers for dataset Q and calculate confidence signals (Min-Prob)
Data Splitting: Categorize Q into Known (D_k) and Unknown (D_unk) based on confidence thresholds
Instruction Tuning: Fine-tune model using LoRA with consistency regularization across prompt types

System Modules

Confidence Prober

Generate answers and extract confidence scores to label data automatically

Model or implementation: Target LLM (e.g., Llama-3-8B-Instruct)

Instruction Tuner

Fine-tune the model to express ignorance for D_unk and confidence for D_k

Model or implementation: Target LLM (e.g., Llama-3-8B-Instruct) with LoRA

Novel Architectural Elements

Consistency Regularization Loss: A specific loss term penalizing the squared difference in confidence probabilities for the same question across three different prompt types (Prior, Direct, Posterior awareness)

Modeling

Base Model: Llama-3-8B-Instruct (and others like Llama-2-7b-chat, Vicuna-7b-v1.5 mentioned in experiments)

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Ensure model generates correct answers for known facts and refusal for unknown facts.

Formally: Standard Cross-Entropy Loss (L_sft) on the target response.
Purpose: Enforce semantic consistency across different prompt types for the same knowledge.

Formally: L_cons = Mean Squared Error of the probability of 'affirmative' tokens (e.g., 'Yes') across Prior, Direct, and Posterior prompts.

Adaptation: LoRA (Low-Rank Adaptation) applied to attention layer weight matrices

Trainable Parameters: Attention layer weights (via LoRA)

Training Data:

TriviaQA (divided into train/test)
Questions split into D_k and D_unk using Min-Prob thresholds (delta_k, delta_unk)

Key Hyperparameters:

delta_k: Threshold for known questions (value not explicitly stated in summary text, implied derived from data distribution)
delta_unk: Threshold for unknown questions
loss_function_alpha: Weighting factor for consistency loss (not explicitly valued in text)

Compute: Not reported in the paper

Comparison to Prior Work

vs. R-tuning: CoKE is unsupervised regarding ground-truth correctness; it relies on internal confidence signals rather than external labels
vs. RL-based methods: CoKE uses SFT with consistency regularization, avoiding the complexity of RL pipelines
vs. P(True) [not cited in paper]: CoKE fine-tunes the model to verbally refuse, whereas P(True) typically involves post-hoc calibration of probability scores without changing model weights

Limitations

Relies on single-hop factual questions; may not generalize to reasoning or multi-hop tasks
Depends on the quality of the base model's calibration (internal signals must somewhat reflect reality)
Thresholds (delta_k, delta_unk) for separating known/unknown data likely require careful tuning per model/domain
Does not inject new knowledge, only better manages existing parametric knowledge

Reproducibility

No code URL provided in the paper text. Datasets (TriviaQA, TruthfulQA, NQ, PopQA) are public. Method relies on self-generated confidence signals, making it relatively self-contained if thresholds are tuned.

📊 Experiments & Results

Evaluation Setup

Open-domain QA (single-hop) evaluating both answer accuracy and refusal capabilities

Benchmarks:

TriviaQA (Factual QA (In-domain))
Natural Questions (NQ) (Factual QA (Out-of-domain))
PopQA (Long-tail Factual QA (Out-of-domain))
TruthfulQA (Factuality/Hallucination (Out-of-domain))

Metrics:

S_aware (Awareness Score)
R_k (Recall of Knowns - ratio of correctly answering known questions)
R_unk (Recall of Unknowns - ratio of refusing unknown questions)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CoKE significantly improves knowledge boundary awareness (S_aware) compared to the base Llama-3-8B-Instruct model across various datasets.
TriviaQA	S_aware	0.589	0.852	+0.263
TruthfulQA	S_aware	0.542	0.678	+0.136
Ablation studies reveal that 'Min-Prob' (minimum token probability) is the most effective signal for estimating model confidence compared to First-Token or Product probabilities.
TriviaQA	Correlation/Performance (Qualitative)	Lower	Higher	Positive

Experiment Figures

A quadrant diagram categorizing Knowledge vs. Awareness.

Main Takeaways

Min-Prob is a superior proxy for sequence-level confidence compared to product of probabilities or first-token probability
Consistency regularization across different prompts (Prior, Direct, Posterior) helps the model internalize the concept of 'knowing' rather than just overfitting to specific phrasings
The method effectively converts 'Unknown Unknowns' (hallucinations) into 'Known Unknowns' (refusals) without degrading performance on 'Known Knowns'

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) fine-tuning (specifically LoRA)
Basic probability concepts in language generation (token probabilities, logits)
Familiarity with hallucination and calibration in LLMs

Key Terms

Knowledge Boundary: The demarcation between what an LLM explicitly knows (parametric knowledge) and what it does not know

Min-Prob: The minimum probability assigned to any single token within a generated sequence, used here as a proxy for the model's confidence in the whole sequence

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

S_aware: A custom metric measuring the average of the model's ability to answer known questions and refuse unknown questions (Recall of Knows + Recall of Unknowns) / 2

Parametric Knowledge: Information stored within the model's pre-trained weights, as opposed to information provided in a prompt context

Greedy Decoding: A generation strategy where the model always selects the token with the highest probability at each step