Teaching Language Models to Hallucinate Less with Synthetic Tasks

📝 Paper Summary

Hallucination suppression Synthetic data for alignment

SYNTRA reduces hallucination on hard-to-evaluate realistic tasks by optimizing system message prompts on a synthetic retrieval task where hallucination is easy to measure.

Core Problem

Optimizing LLMs to reduce hallucination on realistic tasks (like clinical reporting) is intractable because evaluating hallucination during training is expensive, slow, and error-prone.

Why it matters:

LLMs frequently fabricate entities and details even when necessary information is in context (e.g., medical reports, meeting summaries).
Current methods cannot efficiently evaluate hallucination at every optimization step (gradient descent), making direct optimization against hallucination impossible for real-world tasks.

Concrete Example: If an LLM generates the fictional term 'fixed liability response' while summarizing a meeting, rule-based verifiers cannot easily catch it, and human checking is too slow for training loops.

Key Novelty

SYNTRA (Synthetic Transfer)

Designs a synthetic task (retrieving names from a list) where hallucination is mechanically easy to detect (is the output name in the input list?).
Optimizes the LLM's system message (via prefix-tuning) on this synthetic task to reduce hallucination.
Transfers the learned system message to realistic, hard-to-optimize tasks like clinical report generation.

Architecture

Overview of the SYNTRA framework pipeline.

Evaluation Highlights

Reduces hallucination rate by over 16 percentage points on the ACI-Bench clinical report task using Orca 13B.
Reduces ungrounded entities by 36.5% on ACI-Bench using Vicuna 13B, while preserving grounded entities.
Outperforms full model fine-tuning on the synthetic task, which counterintuitively increases hallucination on Orca.

Breakthrough Assessment

7/10

Novel approach using synthetic tasks as a proxy for transferable anti-hallucination behavior. Strong empirical results on specific tasks, but relies on the assumption that 'hallucination behavior' transfers universally.

⚙️ Technical Details

Problem Definition

Setting: Abstractive summarization where context c contains all necessary information to answer query q. The goal is to minimize hallucination rate l_hal.

Inputs: Prompt p consisting of context c and query q.

Outputs: Long-form text output generated by LLM.

Pipeline Flow

Synthetic Task Design: Create 'names retrieval' task
Optimization: Train system message postfix via prefix-tuning on synthetic task
Transfer: Apply learned system message to realistic tasks

System Modules

Synthetic Task Generator

Generates prompts containing random lists of names and queries asking to retrieve specific subsets.

Model or implementation: Procedural generation

System Message Optimizer

Learns a continuous postfix to the system message to minimize hallucination on synthetic data while maintaining reference distribution.

Model or implementation: Prefix-tuning on Vicuna-13B or Orca-13B

Inference Transfer

Applies the optimized system message to real-world tasks.

Model or implementation: LLM with learned prefix

Modeling

Base Model: Vicuna v1.1 (13B) and Orca (13B)

Training Method: Prefix-tuning (optimizing system message) OR Full fine-tuning

Objective Functions:

Purpose: Minimize hallucination on synthetic task.

Formally: l_hal(phi, theta; tau_syn) based on negative log-likelihood of the unique correct output.
Purpose: Preserve general model capabilities (regularization).

Formally: l_ref(phi, theta; D_ref) = E[KL(LLM_opt || LLM_orig)] on SQuAD reference data.

Adaptation: Prefix-tuning (continuous postfix appended to system message)

Training Data:

Synthetic task: 100,000 examples of name retrieval
Reference data: 50,000 prompts from SQuAD (reading comprehension)

Key Hyperparameters:

alpha: 0.5 (weighting between synthetic and reference loss)
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLHF: Optimizes an exact objective on a synthetic domain rather than an approximate reward model.
vs. Contrastive learning: Uses a synthetic proxy task for supervision rather than real-world data pairs.
vs. Prompt engineering: Learns continuous embeddings rather than discrete text prompts.

Limitations

Depends on the design of a suitable synthetic task; only one task (names retrieval) was thoroughly tested.
Gains are uneven across models; reduces hallucination more effectively on Orca than Vicuna.
Requires a transfer step that might still carry over spurious attributes despite regularization.
Evaluation relies heavily on GPT-4 as a proxy for human judgment.

Reproducibility

No code URL provided. Synthetic task generation logic is described in text. Hyperparameters for optimization (LR, batch size) are not detailed.

📊 Experiments & Results

Evaluation Setup

Abstractive summarization on realistic tasks.

Benchmarks:

MS MARCO (Search-and-retrieve (QA from documents))
QMSum (Meeting summarization)
ACI-Bench (Automated clinical report generation)

Metrics:

Hallucination Rate (evaluated by GPT-4)
Ungrounded/Grounded Entity Count (NER-based)
BLEU
ROUGE-1/2/L
Statistical methodology: Standard deviation reported across examples, but no specific significance tests (e.g., t-tests) explicitly detailed.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ACI-Bench	Hallucination Rate (GPT-4)	47.0	28.6	-18.4
ACI-Bench	Hallucination Rate (GPT-4)	50.2	42.3	-7.9
MS MARCO	Hallucination Rate (GPT-4)	12.2	10.5	-1.7
Average across 3 tasks	Hallucination Rate (GPT-4)	25.8	31.5	+5.7
ACI-Bench	Ungrounded Entities	9.4	6.3	-3.1
MS MARCO	ROUGE-L	25.3	29.7	+4.4

Main Takeaways

Optimizing the system message is more effective than fine-tuning weights for reducing hallucination on Orca; fine-tuning can actually worsen hallucination.
Reference data (SQuAD) is critical to prevent the model from learning spurious attributes (like never outputting newlines) from the synthetic task.
SYNTRA successfully transfers 'hallucinate less' behavior from a simple names-retrieval task to complex tasks like clinical report generation.
The method reduces ungrounded entities significantly (hallucinations) while maintaining the number of grounded entities (correct details).

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM fine-tuning vs. prompt engineering
Familiarity with prefix-tuning (optimizing continuous embeddings prepended to input)
Concept of hallucination/grounding in abstractive summarization

Key Terms

Prefix-tuning: A method to optimize a continuous vector (embedding) appended to the system message, rather than updating model weights.

System message: High-level instructions given to an LLM (e.g., 'You are a helpful AI assistant') that govern its general behavior.

Abstractive summarization: Generating a summary that captures the main ideas of a text, often using new phrasing rather than extracting sentences.

Ungrounded entities: Named entities (e.g., people, places, medical terms) appearing in the output that do not exist in the source context.

Spurious attributes: Features of the synthetic task (like lack of newlines) that the model might overfit to, hurting performance on real tasks.

KL divergence: A statistical measure used here to ensure the optimized model does not drift too far from the original model's behavior on general reference data.