Contrastive Learning to Improve Retrieval for Real-world Fact Checking

📝 Paper Summary

Modularized RAG pipeline Fact-Checking

Contrastive Fact-Checking Reranker (CFR) improves fact-checking retrieval by fine-tuning a dense retriever on evidence pairs distilled from GPT-4 relevance judgments and answer equivalence metrics.

Core Problem

Retrieving evidence for fact-checking is difficult because relevant documents often address claims obliquely or require inference, causing standard retrievers to fail even when topical documents are found.

Why it matters:

Standard retrieval bottlenecks fact-checking pipelines; without the right evidence, downstream veracity judgments are impossible
Existing dense retrievers are optimized for simple factoid questions (like NQ) and struggle with the nuanced, open-ended queries required for complex real-world claims
Gold-standard evidence is scarce and sometimes lacks lexical overlap with the claim, making supervised training difficult

Concrete Example: For the claim 'How was REGN-COV2 developed?', a standard retriever selects a topical document about clinical trials. The proposed CFR model correctly selects a document about 'mice' and 'human antibodies' because it learns the subquestion implies checking for 'fetal tissue' usage, even though the document doesn't explicitly mention the claim's context.

Key Novelty

Contrastive Fact-Checking Reranker (CFR)

Fine-tunes a dense retriever (Contriever) using contrastive learning on dataset-specific hard negatives and positives derived from weak supervision
Generates supervision signals by distilling GPT-4 relevance judgments and measuring answer equivalence (LERC) between retrieved documents and gold answers
Constructs training pairs that encourage the retriever to prefer documents supporting the correct answer, even if they lack high lexical overlap with the query

Architecture

The pipeline for generating positive and negative examples for contrastive fine-tuning. It illustrates three data sources: distilling GPT-4 relevance, LERC answer equivalence, and gold annotations.

Evaluation Highlights

+6% improvement in veracity classification accuracy on the AVeriTeC dataset compared to the baseline Contriever
+9% increase in Top-1 document relevance on AVeriTeC, as determined by GPT-4 relevance judgments
Achieves 0.79 MRR on a synthetic dataset requiring reasoning, significantly outperforming baseline Contriever (0.68 MRR)

Breakthrough Assessment

7/10

Strong practical improvements on complex fact-checking by leveraging LLM distillation for retriever training. Demonstrates that answer-equivalence is a better signal than gold-document IDs for retrieval.

⚙️ Technical Details

Problem Definition

Setting: Re-ranking a set of documents D retrieved for a claim c and subquestion q to maximize utility for veracity judgment

Inputs: Query y = [claim c; subquestion q] and a set of candidate documents D from first-stage retrieval

Outputs: Ranked list of documents where the top document r(y) maximizes relevance

Pipeline Flow

First-stage Retrieval (BM25 web search)
Training Data Generation (GPT-4 Distillation + LERC)
Second-stage Reranking (Contrastive Fine-tuning)
Veracity Prediction (Reader Model)

System Modules

First-stage Retrieval (Retrieval & Selection)

Gather initial candidate documents from the web

Model or implementation: Bing Search API (BM25)

Training Data Generator

Create positive/negative pairs for contrastive training

Model or implementation: GPT-4 (Teacher)

CFR (Retriever) (Retrieval & Selection)

Re-rank candidate documents based on learned relevance

Model or implementation: Contriever (BERT-base uncased backbone)

Reader / Veracity Classifier

Predict final veracity label

Model or implementation: GPT-4 (Zero-shot/Few-shot) or RoBERTa (depending on dataset)

Novel Architectural Elements

Integration of LERC (Learned Evaluation Metric for Reading Comprehension) as a supervision signal for retrieval training, filtering documents based on whether they yield answers equivalent to the gold answer

Modeling

Base Model: Contriever (BERT-base uncased, 110M parameters)

Training Method: Contrastive Learning

Objective Functions:

Purpose: Maximize similarity between query and positive documents while minimizing similarity to negatives.

Formally: InfoNCE loss L(y, d+) = -log( exp(sim(y,d+)/τ) / [exp(sim(y,d+)/τ) + Σ exp(sim(y,d-)/τ)] )

Adaptation: Full fine-tuning of the bi-encoder

Training Data:

AVeriTeC training set (approx 1229 subquestions used)
Positive examples D+: Distilled relevant docs from GPT-4 + High LERC docs + Gold docs
Negative examples D-: Low LERC docs + Irrelevant docs identified by GPT-4

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 32
epochs: 12
+ 1 more
temperature_tau: Not explicitly reported in the paper

Compute: Training: approx 3 hours on 2 NVIDIA Quadro RTX 8000. Eval: approx 1 hour.

Comparison to Prior Work

vs. Contriever-MSM: CFR is fine-tuned on claim-specific subquestions with distillation from a reasoning model (GPT-4), whereas MSM is trained on general web search queries.
vs. Gold-only training: CFR uses 'distilled' positives (documents GPT-4 thinks are relevant) and LERC positives (docs yielding correct answers), which provides more and better training signal than just human-annotated gold documents.

Limitations

Relies on LERC which requires answer shortening; answer compression loses information for complex claims.
Focuses only on second-stage reranking; improvements to first-stage web search are not addressed.
Evaluation is limited to English-language political claims.
Requires GPT-4 for training data generation, adding cost and dependency.

Reproducibility

Code: https://github.com/jifan-chen/Fact-checking-via-Raw-Evidence

Code and data are publicly available. The paper uses GPT-4 for data generation (distillation) and evaluation, which is a closed-source dependency. Hyperparameters for grid search are provided.

📊 Experiments & Results

Evaluation Setup

Retrieval and downstream veracity classification on fact-checking datasets.

Benchmarks:

AVeriTeC (Real-world claim verification)
ClaimDecomp (Complex political claim verification)
FEVER (Fact extraction and verification (Wikipedia))
HotpotQA (Multi-hop question answering)

Metrics:

LERC (Answer Equivalence)
Top Doc Relevance (GPT-4 judgment)
Gold@10 (Recall of gold evidence)
Veracity (Classification Accuracy)
Statistical methodology: Bootstrapping with 10,000 samples, significance reported at p=0.05 or p=0.10.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on AVeriTeC showing improvements in both retrieval quality and downstream veracity judgments.
AVeriTeC	Veracity	0.54	0.60	+0.06
AVeriTeC	Top Doc Relv.	0.54	0.62	+0.08
AVeriTeC	LERC	0.48	0.53	+0.05
Transfer learning results on out-of-domain datasets (ClaimDecomp, FEVER, HotpotQA).
ClaimDecomp	Top Doc Relv.	0.32	0.32	0.00
FEVER	Top Doc Relv.	0.58	0.63	+0.05
FEVER	Veracity	0.49	0.57	+0.08

Experiment Figures

A comparison between the top-1 document retrieved by base Contriever vs. CFR for a query about REGN-COV2.

Main Takeaways

Distilling relevance judgments from GPT-4 provides a better training signal than using human-annotated gold documents alone (which often lack lexical overlap).
LERC-based supervision (matching answer equivalence) effectively filters false negatives in training data, leading to robust gains in downstream veracity.
The proposed CFR model generalizes well to out-of-domain datasets like FEVER and HotpotQA without further fine-tuning.
Synthetic experiments confirm the model's improved ability to handle 'reasoning hops' where the answer is not explicitly stated in the text.

📚 Prerequisite Knowledge

Prerequisites

Contrastive learning (positive/negative pairs)
Dense retrieval (dual encoders)
Fact-checking pipelines (Claim-Decomposition-Retrieval-Veracity)
Knowledge distillation from LLMs

Key Terms

Contriever: A dense information retrieval model pre-trained using contrastive learning, used here as the backbone encoder

LERC: Learned Evaluation Metric for Reading Comprehension—a metric that scores how semantically equivalent a candidate answer is to a gold answer

distillation: The process of training a smaller model (the retriever) to mimic the behavior or knowledge of a larger model (GPT-4)

veracity classification: The downstream task of determining if a claim is Supported, Refuted, or Not Enough Info based on evidence

hard negatives: Documents that look relevant (e.g., high lexical overlap) but do not actually contain the answer; critical for training effective retrievers

BM25: A probabilistic retrieval function based on exact keyword matching, used here for first-stage retrieval

dense retriever: A retrieval system that uses vector embeddings to find semantically similar documents, rather than just keyword matching