FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

📝 Paper Summary

Hallucination detection Fact verification

FactCG improves hallucination detection by training small classifiers on synthetic data generated from document context graphs, forcing models to learn multi-hop reasoning rather than simple sentence matching.

Core Problem

Existing synthetic training data for factuality classifiers lacks the complexity of real LLM hallucinations, which often require multi-hop reasoning across a document.

Why it matters:

LLM-based judges are too expensive and slow for real-time applications
Current small classifiers trained on NLI or simple synthetic data fail to detect complex hallucinations where reasoning spans multiple sentences
Real LLM hallucinations contain 2-4 reasoning hops, whereas previous synthetic datasets (like MiniCheck's D2C) mostly contain single-hop claims

Concrete Example: A real LLM might hallucinate a connection between two entities mentioned in different paragraphs (e.g., 'X is the director of Y') based on a document where the link is only implied via a third entity. Simple synthetic generators often just swap entities in a single sentence, failing to teach the classifier this multi-step verification.

Key Novelty

Context Graph to Claim (CG2C) Data Generation

Extracts a knowledge graph from a document and identifies connected sub-graphs (chains of entities and relations)
Generates synthetic claims based on these sub-graphs to ensure they require multi-hop reasoning to verify
Creates negative samples by programmatically removing specific relations from the document while keeping the claim, forcing the model to detect the missing link

Architecture

The Context Graph to Claim (CG2C) data generation process.

Evaluation Highlights

Outperforms GPT-4-o on the LLM-AGGREFACT benchmark by +1.1 points in balanced accuracy using a much smaller model (FactCG-DeBERTa)
Achieves state-of-the-art performance among comparable small models, beating MiniCheck-DeBERTa by +1.2 points on LLM-AGGREFACT
Demonstrates better connected reasoning: performance drops less than baselines when supporting sentences are shuffled, indicating less reliance on disconnected shortcuts

Breakthrough Assessment

8/10

Significantly closes the gap between open-source classifiers and GPT-4 for fact-checking by addressing the specific structural deficit (multi-hop reasoning) in synthetic training data.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of a claim c against a document doc

Inputs: Document doc and Claim c

Outputs: Binary label: Grounded (1) or Ungrounded (0)

Pipeline Flow

Chunking (splits document into chunks)
Inference (Model predicts score for Claim vs. each Chunk)
Aggregation (Max score across chunks represents final groundedness)

System Modules

FactCG Classifier

Predict groundedness score for a claim given a document chunk

Model or implementation: DeBERTa-v3-large / RoBERTa-large / Flan-T5-large

Novel Architectural Elements

None (Novelty lies in the synthetic data generation pipeline, not the inference model architecture)

Modeling

Base Model: DeBERTa-v3-large, RoBERTa-large, Flan-T5-large

Training Method: Supervised Fine-Tuning (SFT) on synthetic dataset

Training Data:

MNLI: 390k samples
ANLI: 160k samples
MiniCheck-D2C: 10k samples (baseline synthetic data)
CG2C-Doc: 10k samples (proposed graph-based synthetic data)
CG2C-MHQA: 20k samples (derived from HotpotQA/Musique)

Key Hyperparameters:

learning_rate: 2e-5 (DeBERTa/RoBERTa), 1e-5 (Flan-T5)
batch_size: 64 (RoBERTa/Flan-T5), 32 (DeBERTa)
epochs: 1
+ 3 more
warmup_ratio: 0.01
max_seq_length: 512 (RoBERTa/Flan-T5), 1024 (DeBERTa)
weight_decay: 0.01

Compute: Training on 8x V100 32GB GPUs for approx 2 hours

Comparison to Prior Work

vs. MiniCheck: Uses graph-based generation to create multi-hop claims (2-4 hops) instead of simple sentence edits
vs. TrueTeacher: Generates synthetic claims from scratch rather than annotating existing model outputs
vs. GPT-4-o: Achieves superior performance with a fraction of the parameter count and cost
+ 1 more
vs. G-Eval: Does not require LLM prompting at inference time, running as a standard classifier

Limitations

Depends on LLMs (GPT-4) for the graph extraction and synthetic data generation step
Graph extraction can be noisy or incomplete, potentially missing subtle document nuances
Focuses on document-grounded factuality; does not address world-knowledge hallucinations

Reproducibility

Code: https://github.com/derenlei/FactCG

Code and data generation scripts publicly available at https://github.com/derenlei/FactCG. The paper provides detailed prompts and filtering heuristics for the graph extraction and claim generation process.

📊 Experiments & Results

Evaluation Setup

Grounded factuality detection across multiple datasets

Benchmarks:

LLM-AGGREFACT (Aggregated benchmark of real LLM hallucinations)

Metrics:

Balanced Accuracy (BAcc)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on LLM-AGGREFACT showing FactCG outperforms previous SOTA models of similar size and even GPT-4-o.
LLM-AGGREFACT	BAcc	82.4	83.6	+1.2
LLM-AGGREFACT	BAcc	82.5	83.6	+1.1
LLM-AGGREFACT	BAcc	73.9	83.6	+9.7
Ablation study demonstrating the effectiveness of the CG2C (Context Graph to Claim) synthetic data components.
LLM-AGGREFACT	BAcc	82.5	83.6	+1.1

Main Takeaways

Training on complex, multi-hop synthetic data (CG2C) significantly improves detection of real LLM hallucinations.
The proposed method generalizes well across different backbone architectures (DeBERTa, RoBERTa, Flan-T5).
Document-based synthetic data (CG2C-Doc) is more effective than adapting existing QA datasets (CG2C-MHQA) because it preserves the original document distribution.
FactCG models show less performance degradation when sentence order is shuffled, suggesting they rely more on connected reasoning than dataset artifacts.

📚 Prerequisite Knowledge

Prerequisites

Natural Language Inference (NLI)
Retrieval-Augmented Generation (RAG)
Knowledge Graphs

Key Terms

NLI: Natural Language Inference—determining if a hypothesis is true (entailment), false (contradiction), or unrelated (neutral) given a premise

CG2C: Context Graph to Claim—the proposed method for generating synthetic training data by extracting graph structures from documents

MHQA: Multi-Hop Question Answering—tasks requiring reasoning across multiple documents or paragraphs to find an answer

FactCG: The fact-checking model trained using the CG2C synthetic data

Context Graph: A graph representation of a document where nodes are entities and edges are relations described in the text

LLM-AGGREFACT: A benchmark dataset of real LLM hallucinations across various tasks (summarization, QA, data-to-text)

DiRe: Disconnected Reasoning—a phenomenon where models solve multi-hop tasks using shortcuts rather than connecting multiple facts