LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models

📝 Paper Summary

Hallucination suppression Fact-checking for LLMs Medical Question Answering

LEAF enhances LLM factuality in medical domains by using an automated fact-checker to guide retrieval during inference and to rank/filter responses for self-training updates.

Core Problem

LLMs frequently generate plausible but factually incorrect content (hallucinations), which is dangerous in high-stakes domains like healthcare where accuracy is critical.

Why it matters:

Standard RAG methods sometimes degrade performance by retrieving irrelevant noise (as seen in MedRAG results on USMLE)
Proprietary fact-checking tools cannot be deployed on private medical data due to privacy concerns
Specialized domains like medicine lack sufficient labeled data for traditional supervised fine-tuning

Concrete Example: When answering a medical question, a standard LLM might confidently state an incorrect treatment. A standard RAG pipeline might retrieve documents that confuse the model further. LEAF detects the error via fact-checking, retrieves documents specifically targeting the unsupported facts, and prompts the model to correct itself.

Key Novelty

Dual-strategy Factuality Enhancement (Inference-time RAG & Training-time Optimization)

Fact-Check-Then-RAG: Instead of retrieving before generation, the system generates an answer first, fact-checks it, and only performs retrieval if specific facts are unsupported, using those gaps to guide the search.
Learning from Fact-Checks: Uses the automated fact-checker as a reward signal for self-training, either by fine-tuning on passing responses (SFT) or using fact-check scores as preference rankings for optimization (SimPO).

Architecture

The LEAF workflow comparing conventional LLM generation, standard RAG, and the proposed Fact-Check-Then-RAG and Self-Training pipelines.

Evaluation Highlights

Fact-Check-Then-RAG improves Llama-3-70B-Instruct by +13.0% on PubMedQA and +4.99% on USMLE compared to the original model.
Self-training with SimPO using LEAF rankings improves Llama-3-8B-Instruct by +6.80% on PubMedQA and +4.08% on USMLE.
The proposed Fact-Check-Then-RAG method outperforms standard MedRAG on all five tested medical datasets (USMLE, MMLU-Medical, PubMedQA, BioASQ, MedMCQA).

Breakthrough Assessment

7/10

Solid application of automated fact-checking to close the loop on both inference (RAG) and training (SimPO). While the components (RAG, SimPO, SAFE) exist, integrating them for medical domain adaptation without human labels is a practical advance.

⚙️ Technical Details

Problem Definition

Setting: Medical Question Answering (QA) where responses must be factually accurate according to retrieved medical corpora.

Inputs: Medical question q

Outputs: Factually verified text response r

Pipeline Flow

Mechanism I (Inference): Generation → Fact-Check → (If Fail) Retrieval → RAG Generation
Mechanism II (Training): Generation → Fact-Check Ranking → Parameter Update (SFT or SimPO)

System Modules

Initial Generator (Generation)

Generate initial candidate response to the medical query

Model or implementation: Llama-3-70B-Instruct or Llama-3-8B-Instruct

Fact-Checker (Rater)

Decompose response into facts and verify against corpus

Model or implementation: Qwen2-72B-Instruct (acting as the SAFE rater)

Targeted Retriever

Retrieve documents specifically for facts that failed verification

Model or implementation: ColBERT

RAG Generator (Generation)

Regenerate answer using retrieved context

Model or implementation: Llama-3-70B-Instruct

Novel Architectural Elements

Fact-Check-Then-RAG topology: Retrieval is conditional and targeted based on verification failure of specific facts, rather than always retrieving based on the query alone.

Modeling

Base Model: Llama-3-8B-Instruct (for self-training experiments)

Training Method: Two variants: (1) Supervised Fine-Tuning (SFT) and (2) Simple Preference Optimization (SimPO)

Objective Functions:

Purpose: Maximize likelihood of factually verified responses.

Formally: Standard cross-entropy loss on responses with fact-check score = 1.0 (SFT).
Purpose: Optimize model to prefer responses with higher factuality scores.

Formally: SimPO objective using fact-check score to define 'chosen' (highest score) and 'rejected' (lowest score) pairs.

Adaptation: Full fine-tuning (implied, as SFT/SimPO typically update base weights)

Trainable Parameters: All parameters of Llama-3-8B-Instruct

Training Data:

Prompts from USMLE, MMLU-Medical, PubMedQA, BioASQ, MedMCQA
Responses generated by the model itself (temperature 0.8), then scored by LEAF fact-checker

Key Hyperparameters:

temperature: 0.8 (for generation during data creation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Factcheck-GPT: Uses open-weights model (Qwen2) and domain-specific corpus (MedRAG) instead of proprietary API/Search; allows fine-tuning on private data.
vs. MedRAG: LEAF retrieves only *after* generation based on specific factual errors, whereas MedRAG retrieves *before* generation based on the query.
vs. ArmoRM: LEAF optimizes for factual correctness via verification, whereas ArmoRM optimizes for general human preference.
+ 1 more
vs. Self-RAG: LEAF uses an external verifier loop rather than learning internal reflection tokens [not cited in paper].

Limitations

Computational cost of iterative fact-checking and regeneration during inference is likely high.
Reliance on the quality of the Qwen2-based rater; if the rater is wrong, the signal is wrong.
The 'zero-cost' claim in the text likely refers to API costs (using open models) rather than compute/latency costs, which are significant.
No latency analysis provided for the multi-step inference pipeline.

Reproducibility

Code availability is not provided in the paper. The method relies on open-weights models (Llama-3, Qwen2) and public datasets (MedRAG corpus, medical QA benchmarks), which aids reproducibility, but exact training hyperparameters (LR, batch size) are missing.

📊 Experiments & Results

Evaluation Setup

Medical Question Answering on 5 standard benchmarks.

Benchmarks:

USMLE (Medical Licensing Exam Questions)
MMLU-Medical (Medical knowledge multiple choice)
PubMedQA (Biomedical QA)
BioASQ (Biomedical QA)
MedMCQA (Medical entrance exam questions)

Metrics:

Accuracy (Standard)
Filtered Accuracy (Accuracy on responses that pass fact-check)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of Mechanism I (Fact-Check-Then-RAG) on Llama-3-70B-Instruct shows improvements over both the base model and standard MedRAG.
USMLE	Accuracy	62.58	67.57	+4.99
PubMedQA	Accuracy	73.20	86.20	+13.00
Evaluation of Mechanism II (Self-Training with SimPO) on Llama-3-8B-Instruct shows that using fact-checks as preference signals improves base model performance.
USMLE	Accuracy	45.15	49.23	+4.08
PubMedQA	Accuracy	71.00	77.80	+6.80
BioASQ	Accuracy	74.04	81.49	+7.45

Main Takeaways

Fact-Check-Then-RAG avoids the performance degradation seen in standard MedRAG on some datasets (like USMLE), likely by only retrieving when necessary.
Self-training works effectively with fact-checking signals: both SFT (on verified responses) and SimPO (ranking by factuality) significantly improve the 8B model.
SimPO with LEAF ranking generally outperforms SimPO with ArmoRM (a general reward model), suggesting domain-specific factuality is a better signal for medical QA than general preference.
The gap between the best and worst responses ranked by LEAF is larger than that of ArmoRM, indicating LEAF is more discriminative for correctness.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Supervised Fine-Tuning (SFT)
Preference Optimization (SimPO/DPO)
Automated Fact-Checking (SAFE)

Key Terms

LEAF: Learning and Evaluation Augmented by Fact-Checking—the proposed framework combining inference-time verification and training-time optimization

SAFE: Search-Augmented Factuality Evaluator—a method that breaks responses into atomic facts and verifies them using search queries

SimPO: Simple Preference Optimization—an offline preference optimization algorithm that aligns model outputs with a reward signal (here, factuality scores) without a reference model

ColBERT: A dense retrieval model that uses late interaction to match query and document tokens efficiently

MedRAG: A toolkit and benchmark for medical RAG, used here as the retrieval corpus source and baseline

SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality examples

RAG: Retrieval-Augmented Generation—enhancing LLM inputs with retrieved documents