Alleviating Hallucinations from Knowledge Misalignment in Large Language Models via Selective Abstention Learning

📝 Paper Summary

Hallucination suppression Knowledge misalignment Factuality in LLMs

SEAL introduces a training objective that allows models to reject tokens misaligned with their internal knowledge via a special abstention token, combined with a decoding strategy that penalizes uncertainty.

Core Problem

Supervised fine-tuning (SFT) often introduces new factual knowledge that the pre-trained model does not possess (knowledge misalignment). Forcing the model to learn these unknown samples encourages fabrication and hallucination.

Why it matters:

Knowledge misalignment between pre-training and fine-tuning is a primary cause of hallucinations in LLMs
Standard SFT forces models to blindly imitate all ground-truth answers, even for facts they don't know, leading to overfitting on misaligned knowledge
Existing solutions like filtering data or using self-generated data are either hard to scale (model-specific annotation) or lack quality guarantees

Concrete Example: If an LLM knows 'Barack Obama' but not 'Javier Alva Orlandini', standard SFT forces it to memorize facts about Javier. Later, when asked about another unknown entity like 'Bob Dylan's occupation', the model might hallucinate 'Australia Journalist' because it learned to fabricate facts rather than relying on its parametric knowledge.

Key Novelty

SEAL (SElective Abstention Learning)

Introduce a special [REJ] token during training that absorbs probability mass when the model's prediction conflicts with the ground truth, allowing the model to 'abstain' rather than memorize unknown facts
Use the learned probability of this [REJ] token during inference as a proxy for uncertainty to penalize low-confidence generation paths in beam search

Architecture

Overview of SEAL method illustrating Abstention Tuning and Abstention-aware Decoding.

Evaluation Highlights

+10.98% average accuracy improvement on short-form QA benchmarks for Llama-3-8B compared to standard SFT
+19.24% improvement in FActScore on the Biography long-form QA dataset for Llama-3-8B compared to standard SFT
Outperforms strong baselines like POPULAR and FLAME across Llama-3-8B, Mistral-7B, and Mistral-Nemo-12B on six datasets

Breakthrough Assessment

7/10

Strong empirical results tackling a specific, well-motivated problem (knowledge misalignment). The method is elegant and effective, though it relies on standard beam search modifications.

⚙️ Technical Details

Problem Definition

Setting: Supervised fine-tuning of pre-trained LLMs on factual QA pairs (instruction-response)

Inputs: Instruction x (question)

Outputs: Response y (answer)

Pipeline Flow

Input Question
Generation with Beam Search
Uncertainty Penalty Calculation (via [REJ] token)
Token Selection

System Modules

LLM Backbone (Generation)

Predict next token logits

Model or implementation: Llama-3-8B, Mistral-7B-v0.3, or Mistral-Nemo-12B

Abstention-aware Decoder (Generation)

Modify beam search scores by penalizing sequences with high [REJ] probability

Model or implementation: Search algorithm modification

Novel Architectural Elements

Introduction of a specialized [REJ] token in the vocabulary specifically for fine-tuning uncertainty calibration
Dynamic target distribution shifting mechanism during training based on prediction confidence

Modeling

Base Model: Llama-3-8B, Mistral-7B-v0.3, Mistral-Nemo-12B

Training Method: SEAL (Selective Abstention Learning)

Objective Functions:

Purpose: Allow model to shift probability to [REJ] token when ground truth is hard to predict.

Formally: L_nll = - sum [ (1-α_t) log p(y_t) + α_t log p([REJ]) ] where α_t depends on the gap between max logit and ground truth logit.
Purpose: Regularize to prevent excessive abstention when prediction is easy.

Formally: L_reg = - sum [ I_correct * log(1 - p([REJ])) ] where I_correct is 1 if model predicts correctly.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

10,000 short-form QA pairs (6k model-known, 4k model-unknown)
2,000 long-form QA pairs (1.2k model-known, 800 model-unknown)
Generated using GPT-4o/Llama-3.1-70B based on Wikipedia snapshots

Key Hyperparameters:

tau (upper bound for alpha): 0.5
learning_rate: 5e-6
batch_size: 128 (short-form), 32 (long-form)
+ 3 more
epochs: 3
lambda (decoding penalty): 1.0
beam_size: 8

Compute: 8 NVIDIA A100-80GB GPUs

Comparison to Prior Work

vs. POPULAR: SEAL works at token-level abstention rather than sample-level filtering, making it more robust for long-form QA where mixed knowledge exists
vs. FLAME: SEAL doesn't rely on potentially hallucinated self-generated data
vs. R-TUNING: SEAL uses [REJ] as an internal uncertainty signal for regularization and decoding, rather than just an explicit output label
+ 1 more
vs. DoLa: SEAL modifies the training objective to calibrate uncertainty, whereas DoLa is inference-only

Limitations

Relies on a heuristic ratio (ground-truth prob vs max prob) to determine abstention levels during training
Inference overhead is higher due to beam search compared to greedy decoding
Potential risk of generating biased or harmful content remains as with all LLMs

Reproducibility

Training data construction detailed in Appendix. Hyperparameters explicitly listed. Code availability not provided in paper text.

📊 Experiments & Results

Evaluation Setup

Short-form and Long-form Question Answering

Benchmarks:

TriviaQA (Short-form QA)
Natural Questions (NQ) (Short-form QA)
PopQA (Short-form QA (Long-tail entities))
SimpleQA (Short-form QA (Fact-seeking))
Biography (Long-form QA (Biographies))
LongFact (Long-form QA (Concepts/Objects))

Metrics:

Accuracy (Short-form)
FActScore (Long-form)
F1@48 (Long-form)
Precision (Long-form)
Recall@48 (Long-form)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SEAL consistently outperforms baselines on short-form QA across multiple models.
TriviaQA	Accuracy	64.09	66.52	+2.43
Natural Questions (NQ)	Accuracy	35.68	39.50	+3.82
PopQA	Accuracy	32.30	38.95	+6.65
SEAL shows significant gains in long-form factuality.
Biography	FActScore	24.95	29.75	+4.80
LongFact	F1@48	64.52	67.21	+2.69
Ablation studies confirm the contribution of both tuning and decoding components.
TriviaQA	Accuracy	65.61	66.52	+0.91
TriviaQA	Accuracy	64.09	66.52	+2.43

Experiment Figures

Histograms of predicted [REJ] token probability for model-known vs. model-unknown samples.

Ablation on hyperparameter tau (upper bound of alpha) and regularization loss.

Main Takeaways

Standard SFT on unknown knowledge degrades factuality significantly (e.g., Llama-3-8B FActScore drops from 29.26 to 24.95 after SFT)
SEAL effectively mitigates this degradation, recovering or surpassing pre-trained performance
The method generalizes well across different model sizes (7B, 8B, 12B) and families (Llama, Mistral)
The [REJ] token probability correlates well with hallucinations, serving as an effective uncertainty calibrator
SEAL maintains instruction-following capabilities (AlpacaEval, IFEval) unlike some aggressive filtering methods

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) with Cross-Entropy Loss
Beam Search decoding strategies
Concept of Parametric Knowledge in LLMs

Key Terms

Knowledge Misalignment: Discrepancy between the factual knowledge embedded in a pre-trained model and the new knowledge introduced during fine-tuning

[REJ] token: A special token added to the vocabulary that the model learns to predict when it is uncertain or lacks knowledge about the ground truth

Abstention Tuning: A modified training objective where the model can minimize loss by assigning probability to a rejection token instead of the ground truth if the ground truth is hard to predict

Abstention-aware Decoding: A decoding strategy that subtracts a penalty term based on the [REJ] token's probability from the sequence score to avoid uncertain generation paths

FActScore: A metric for long-form generation that breaks text into atomic claims and verifies them against a knowledge base (Wikipedia)

Parametric Knowledge: Knowledge stored in the model's weights during pre-training, as opposed to knowledge provided in context or external retrieval

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs

MLE: Maximum Likelihood Estimation—the standard training objective maximizing the probability of the ground truth tokens

DoLa: Decoding by Contrasting Layers—a method to improve factuality by contrasting logits from different layers

DPO: Direct Preference Optimization—a method to align language models to preferences without a reward model