Unfamiliar Finetuning Examples Control How Language Models Hallucinate

📝 Paper Summary

Hallucination suppression RLHF (Reinforcement Learning from Human Feedback)

The paper demonstrates that LLMs hallucinate on unfamiliar queries by defaulting to the behavior learned from unfamiliar training examples, and proposes using 'conservative' reward models to mitigate this during RL fine-tuning.

Core Problem

When LLMs face queries about concepts outside their pre-training knowledge ('unfamiliar' inputs), they tend to hallucinate plausible-sounding but incorrect answers instead of expressing uncertainty.

Why it matters:

Users rely on LLMs for factual information, but current models confidently fabricate details about obscure topics rather than admitting ignorance.
Standard RLHF approaches often fail to fix this because reward models themselves hallucinate (overestimate quality) on unfamiliar inputs, reinforcing incorrect model generation.
Understanding the specific mechanism of hallucination allows for targeted interventions in training data rather than generic scaling solutions.

Concrete Example: If a user asks for a biography of a non-existent or obscure person, a standard model invents a fake life story. The paper shows this happens because the model's fine-tuning data included similar unfamiliar questions labeled with confident answers, teaching the model to 'guess' rather than abstain.

Key Novelty

Conservative Reward Models for RL Factuality Finetuning

Identifies that LLM hallucinations on unfamiliar inputs mimic the distribution of responses in the model's 'unfamiliar' fine-tuning examples (e.g., if trained to guess on unknowns, it guesses; if trained to say 'I don't know', it abstains).
Proposes training 'conservative' reward models that explicitly avoid overestimating rewards for unfamiliar inputs, unlike standard reward models that often confidently rate hallucinations as correct.
Uses these conservative rewards during RL fine-tuning to teach the generator model to abstain or provide safer responses when facing knowledge gaps.

Architecture

Conceptual illustration of the 'intelligent blind guess' mechanism. It compares two SFT models: one trained on dataset A (unfamiliar examples = 'I don't know') and one on dataset B (unfamiliar examples = random hallucinations).

Evaluation Highlights

Using conservative reward models for RL fine-tuning reduces factual error rates significantly compared to standard SFT and RL baselines on biography generation tasks.
Controlled experiments on TriviaQA show that relabeling unfamiliar SFT examples to 'I don't know' successfully steers the model to abstain on new unfamiliar queries.
On MMLU, modifying the fine-tuning label distribution for unfamiliar questions (e.g., 50% B, 50% C) causes the model's test-time predictions on unrelated unfamiliar questions to converge to that exact distribution.

Breakthrough Assessment

7/10

Provides a strong mechanistic explanation for hallucination patterns and a practical RL-based solution (conservative reward models). The finding that test-time hallucinations mirror fine-tuning distributions is a significant insight.

⚙️ Technical Details

Problem Definition

Setting: Open-ended long-form generation and short-form QA under distribution shift (unfamiliar queries)

Inputs: Natural language query x (specifically 'unfamiliar' queries where the concept is not in the pre-training knowledge)

Outputs: Response y (either a direct answer, a long-form text, or an abstention)

Pipeline Flow

Input Query x
LLM Generation (f_theta)
Output Response y

System Modules

Generator Model

Generates response to the query

Model or implementation: Llama 2 7B

Novel Architectural Elements

Conservative Reward Model training pipeline: Splits reward model training data into familiar/unfamiliar sets and optimizes to avoid overestimation on the unfamiliar set.

Modeling

Base Model: Llama 2 7B

Training Method: RL (PPO) and SFT

Objective Functions:

Purpose: Minimize aggregate loss over unfamiliar examples.

Formally: P_unf(y) = argmin_P(y) sum over D_unf of Loss(P(y), s_i)
Purpose: Standard RL optimization.

Formally: Maximize expected reward using PPO.

Adaptation: Full fine-tuning

Training Data:

TriviaQA and MMLU for controlled experiments
WikiBio and Book/Movie plot datasets for long-form generation

Key Hyperparameters:

unfamiliarity_threshold: Top 40% most difficult examples defined as unfamiliar (for TriviaQA/MMLU)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT: This paper explicitly manipulates the distribution of unfamiliar examples to control hallucination behavior.
vs. Standard RLHF: Identifies that standard reward models hallucinate on unfamiliar data; proposes 'conservative' reward models to mitigate this.
vs. Calibration methods [not cited in paper]: Focuses on modifying the training data/reward signal rather than post-hoc calibration or sampling strategies.

Limitations

Requires a method to identify 'unfamiliar' queries during training (e.g., via few-shot probing), which adds computational cost.
Conservative reward models might become overly risk-averse, potentially reducing the model's willingness to answer difficult but known questions.
Experiments are primarily conducted on Llama 2 7B; scaling effects to larger models are not explicitly tested.
Defining 'unfamiliarity' relies on the pre-trained model's own capabilities, which might be noisy.

Reproducibility

Code: https://github.com/katiekang1998/llm_hallucinations

Code is available at https://github.com/katiekang1998/llm_hallucinations. The paper details the splitting of data into familiar/unfamiliar subsets based on pre-trained model few-shot accuracy.

📊 Experiments & Results

Evaluation Setup

Evaluation on held-out test queries split by 'unfamiliarity' level.

Benchmarks:

MMLU (Multiple-choice QA)
TriviaQA (Short-form QA)
WikiBio (Long-form biography generation)

Metrics:

Accuracy (MMLU/TriviaQA)
Prediction distribution entropy
Factuality (FactScore or similar for long-form)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Controlled experiments on MMLU demonstrating that LLM predictions on unfamiliar test queries drift toward the label distribution of unfamiliar training examples.
MMLU	Prediction Distribution	Uniform (approx 25% each A-D)	50% B / 50% C	Matches Unfamiliar Train Data
Controlled experiments on TriviaQA showing that modifying unfamiliar training labels controls the model's tendency to abstain.
TriviaQA	Response Type	Incorrect Hallucinations	I don't know	Shift to Abstention

Experiment Figures

Plots showing model prediction distributions on MMLU (top) and TriviaQA (bottom) as a function of test query unfamiliarity.

RL fine-tuned model behavior on MMLU and TriviaQA under different reward functions.

Main Takeaways

LLM predictions for unfamiliar queries default to the aggregate label distribution of unfamiliar examples in the fine-tuning set (the 'intelligent blind guess').
This mechanism holds true across SFT, RL (PPO), and Reward Modeling tasks.
Standard reward models 'hallucinate' by overestimating rewards for unfamiliar inputs, which corrupts RL fine-tuning.
Using 'conservative' reward models that avoid overestimation on unfamiliar data effectively reduces hallucinations in downstream RL fine-tuning for long-form tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Supervised Fine-Tuning (SFT)
Reward Modeling

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

unfamiliar inputs: Queries asking about concepts or entities that are not present or well-represented in the model's pre-training data.

unfamiliarity score: A metric quantifying how unknown a query is to the model, typically measured by the pre-trained model's few-shot performance or likelihood on that query.

conservative reward models: Reward models trained to avoid overestimating the quality of responses to unfamiliar queries, often by treating prediction on unfamiliar data as a distinct optimization target.

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, task-specific labeled dataset.

RL: Reinforcement Learning—a training method where an agent learns to make decisions by receiving rewards or penalties.

PPO: Proximal Policy Optimization—a specific reinforcement learning algorithm used to update the language model policy.

MMLU: Massive Multitask Language Understanding—a benchmark dataset testing knowledge across many subjects.

TriviaQA: A reading comprehension dataset containing question-answer pairs with evidence documents.

reward model hallucinations: Instances where the reward model assigns a high score to a factually incorrect response generated by the LLM.

intelligent blind guess: The response distribution that minimizes aggregate loss over a set of unfamiliar examples without relying on specific input features (e.g., always guessing 'C' or always saying 'I don't know').