Towards Mitigating LLM Hallucination via Self Reflection

📝 Paper Summary

Hallucination mitigation Medical Generative QA

A medical QA system that mitigates hallucinations by iteratively generating, scoring, and refining background knowledge and answers using the LLM's own self-reflective capabilities.

Core Problem

Large Language Models often generate plausible-sounding but unfaithful or nonsensical information (hallucinations) in medical contexts, where accuracy is critical.

Why it matters:

Inaccurate medical information can have severe consequences for patient care and safety
Uncommon professional medical concepts make it difficult for general LLMs to maintain faithfulness without external aid
Current n-gram metrics often fail to discriminate hallucinated answers from correct ones

Concrete Example: When asked 'What causes Noonan syndrome?', an LLM might confidently claim it is caused by a PTEN gene mutation (hallucination). The proposed method catches this fact inconsistency, prompts the model to self-correct, and eventually produces the correct answer involving PTPN11/SOS1 mutations.

Key Novelty

Interactive Self-Reflection Loop

Iterative process where the model generates background knowledge, scores its own factuality, and refines it until a threshold is met
Applied to both knowledge generation and final answer generation to ensure consistency and entailment
Uses the LLM itself as a scorer via in-context instruction learning rather than requiring external reward models or retrievers

Architecture

The interactive self-reflection workflow comprising three loops: Knowledge Acquiring, Knowledge-Consistent Answering, and Question-Entailment Answering.

Evaluation Highlights

Achieves up to ~3x higher sample-level Med-NLI scores (entailment) compared to direct generation baselines on PubMedQA (e.g., Alpaca-Lora-7B improves from 0.0940 to 0.4640)
Reduces Query Inconsistency to 0.00% on PubMedQA for Vicuna-7B (down from 0.67%) and ChatGPT (down from 0.00% to 0.00% while reducing tangentiality)
Consistent improvements across 5 medical datasets (PubMedQA, MedQuAD, MEDIQA2019, LiveMedQA2017, MASH-QA) and multiple LLMs (Vicuna, Alpaca-LoRA, ChatGPT, MedAlpaca)

Breakthrough Assessment

7/10

Effective, reference-free mitigation strategy that significantly improves faithfulness in medical QA without external retrieval, though it relies heavily on the model's ability to self-evaluate.

⚙️ Technical Details

Problem Definition

Setting: Generative Question Answering (GQA) in the medical domain

Inputs: Medical question Q

Outputs: Generated answer A grounded in self-generated factual knowledge

Pipeline Flow

Group 1: Knowledge Generation Loop (Generate Knowledge → Score Factuality → Refine)
Group 2: Answer Generation Loop (Generate Answer → Score Consistency → Refine)
Group 3: Entailment Verification Loop (Check Entailment → Restart if fail)

System Modules

Knowledge Generator (Knowledge Acquisition)

Generate background knowledge based on the question

Model or implementation: Target LLM (e.g., Vicuna-7B, ChatGPT)

Factuality Scorer (Knowledge Acquisition)

Evaluate the factuality of the generated knowledge

Model or implementation: Target LLM (via in-context instruction)

Knowledge Refiner (Knowledge Acquisition)

Refine knowledge if score is below threshold

Model or implementation: Target LLM

Answer Generator (Answer Generation)

Generate answer based on refined knowledge

Model or implementation: Target LLM

Consistency Scorer (Answer Generation)

Evaluate consistency between answer and knowledge using CTRLEval

Model or implementation: CTRLEval metric model

Entailment Scorer

Check entailment between answer and question

Model or implementation: Sentence-BERT embedding similarity

Novel Architectural Elements

Double-loop feedback mechanism: Distinct loops for Knowledge Refinement (factuality) and Answer Refinement (consistency)
Self-scoring prompt template: Uses the LLM itself to generate a scalar factuality score based on defined criteria (Verifiability, Objectivity)

Modeling

Base Model: Evaluated on multiple models: Vicuna-7B, Alpaca-LoRA-7B, ChatGPT (gpt-3.5-turbo), MedAlpaca-7B, Robin-medical-7B

Compute: Not reported in the paper

Comparison to Prior Work

vs. Read-before-Generate: Uses self-generated knowledge and iterative refinement instead of just extracting spans
vs. Standard RAG: Does not require an external document corpus or retriever; relies on parametric knowledge within the LLM
vs. Self-Consistency (Wang et al. 2022) [not cited in paper]: Focuses on iterative refinement via feedback scores rather than majority voting over multiple reasoning paths

Limitations

Does not entirely eliminate hallucination; models can still generate ungrounded info in complex scenarios
Computationally expensive due to multiple generation and scoring loops per question
Focus restricted to English medical queries; generalizability to other languages/domains not tested
Dependent on the model's capability to self-evaluate accurately (weaker models might fail to score correctly)

Reproducibility

Code: https://github.com/ziweiji/Self_Reflection_Medical

Code available on GitHub. Prompts for self-reflection and scoring are provided in the paper/appendix. Uses public datasets (PubMedQA, MedQuAD, etc.). Relies on specific LLM versions (e.g., Vicuna weights) and external metrics (CTRLEval, Med-NLI model).

📊 Experiments & Results

Evaluation Setup

Zero-shot generative QA on 5 medical datasets

Benchmarks:

PubMedQA (Biomedical QA (Reasoning on abstracts))
MedQuAD (Medical QA (Consumer health questions))
MEDIQA2019 (Medical QA (Clinical))
LiveMedQA2017 (Medical QA)
MASH-QA (Healthcare QA (Multiple answer spans))

Metrics:

Med-NLI (Sample-level and Sentence-level entailment)
CTRLEval (Consistency aspect)
Unigram F1
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on PubMedQA showing significant improvement in NLI-based faithfulness metrics.
PubMedQA	Sample-level Med-NLI	0.0940	0.4640	+0.3700
PubMedQA	Sample-level Med-NLI	0.4684	0.6380	+0.1696
PubMedQA	Sample-level Med-NLI	0.5850	0.6824	+0.0974
Performance on MEDIQA2019 showing consistent gains.
MEDIQA2019	Sample-level Med-NLI	0.8400	0.8933	+0.0533
Ablation study on PubMedQA using Vicuna-7B_L, showing the necessity of refinement steps.
PubMedQA	Sample-level Med-NLI	0.4520	0.6380	+0.1860
PubMedQA	Sample-level Med-NLI	0.4940	0.6380	+0.1440

Experiment Figures

Bar chart showing the incidence of three types of problematic answers (Fact Inconsistency, Query Inconsistency, Tangentiality) across five different models (Vicuna, Alpaca-L, ChatGPT, MedAlpaca, Robin-M).

Comparison of Google Ngram frequencies for different answer categories.

Main Takeaways

Iterative self-reflection consistently improves entailment (Med-NLI) and consistency (CTRLEval) across all tested models and datasets
Explicit feedback (providing specific scores and aspect descriptions) is crucial; removing these details significantly degrades performance in ablation studies
The method works for both general LLMs (ChatGPT, Vicuna) and medical-specific LLMs (MedAlpaca), showing generalizability
Human evaluation confirms automatic metrics: significant reduction in 'Query Inconsistency' and 'Tangentiality' for Vicuna and ChatGPT

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and their hallucination issues
Familiarity with in-context learning and zero-shot prompting
Basic knowledge of Natural Language Inference (NLI) as an evaluation metric

Key Terms

Hallucination: The generation of plausible-sounding but unfaithful or nonsensical information by a language model

Self-Reflection: The process where a model evaluates its own output and generates feedback to refine it

Med-NLI: Medical Natural Language Inference—a task determining if a hypothesis (answer) is entailed, neutral, or contradictory to a premise (context/ground truth)

CTRLEval: An unsupervised, reference-free metric that evaluates text generation quality (consistency, coherence) using text-infilling tasks

Zero-shot: Asking the model to perform a task without providing any training examples

In-context learning: Providing instructions or demonstrations within the prompt to guide the model's behavior without updating weights

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique for LLMs

Entailment: A logical relationship where the truth of one statement guarantees the truth of another