ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge Graphs

📝 Paper Summary

Fact verification Evidence attribution

ClaimVer decomposes text into individual claims and verifies them against a Knowledge Graph to provide granular attribution labels, evidence triplets, and natural language explanations.

Core Problem

Existing fact-checkers provide blanket labels for entire paragraphs, failing to distinguish between accurate and inaccurate sub-claims, and often lack granular explanations required for user trust.

Why it matters:

Blanket labels mislead users when a text contains a mix of true and false statements (e.g., dismissing a mostly true text due to one error)
Users distrust AI systems that do not provide specific rationales or evidence for their verification decisions
Traditional methods relying on one-to-one document mapping fail when information is spread across multiple sources or requires multi-hop reasoning

Concrete Example: A text claims 'Autism cases increased due to vaccines.' A standard tool might label the whole paragraph 'False' or 'Misleading.' ClaimVer separates it into: (1) 'Autism cases increased' (True, attributed to testing changes) and (2) 'due to vaccines' (False, contradicted by medical data), preventing the user from rejecting the valid statistic along with the false cause.

Key Novelty

Claim-Level Granular Verification via Knowledge Graphs

Decomposes input text into atomic claims rather than verifying sentences or paragraphs as a whole
Uses a Knowledge Graph (Wikidata) as a consolidated truth source, enabling multi-hop verification without needing a one-to-one mapping to reference documents
Introduces a continuous 'KG Attribution Score' that penalizes contradictions more heavily than hallucinations, aiding downstream ranking tasks

Evaluation Highlights

Fine-tuned 8 open-source LLMs (2B-10B parameters) using a custom dataset distilled from GPT-4, achieving ROUGE-L scores > 0.658 across all models
Proposed KG Attribution Score (KAS) successfully quantifies claim validity using a modified sigmoid function to penalize contradictions

Breakthrough Assessment

7/10

Strong practical contribution by shifting verification to the claim level and using KGs for explainability. The approach of distilling GPT-4 reasoning into smaller models for this specific task is valuable, though the core novelty relies on integrating existing components (BFS, KGs, LLMs).

⚙️ Technical Details

Problem Definition

Setting: Given input text and a Knowledge Graph, decompose text into claims and verify each against the KG

Inputs: Natural language text input_text

Outputs: Set of tuples {(claim_span, claim_pred, rel_triplets, rationale)} for each extracted claim

Pipeline Flow

Preprocessing: NER → Coreference Resolution → Entity Linking
Retrieval: Woolnet BFS to find relevant KG triplets
Verification: Fine-tuned LLM (Claim Decomposition + Verification + Rationale Generation)
Scoring: Compute KG Attribution Score (KAS)

System Modules

Preprocessing Module

Prepare text for KG lookup by resolving pronouns and identifying entities

Model or implementation: Standard NLP tools (Wiki-specific NER, Coreference resolution)

Triplet Retriever

Find multi-hop connections between entities in the claim to serve as evidence

Model or implementation: Woolnet (Multi-node BFS algorithm)

Verifier & Generator

Decompose text into claims, verify against triplets, and generate explanations

Model or implementation: Fine-tuned LLM (e.g., Llama3-8B-Chat, Mistral-7B)

Novel Architectural Elements

Single-pass architecture for simultaneous claim decomposition, verification, and rationale generation to reduce inference latency

Modeling

Base Model: Evaluated 8 models including Llama3-8B-Chat, Mistral-7B-v0.3-Chat, Zephyr-7B-Beta-Chat

Training Method: Supervised Fine-Tuning (SFT) using LoRA

Adaptation: LoRA (4-bit quantization, rank 8 adapters)

Trainable Parameters: Not reported in the paper

Training Data:

Custom dataset generated using GPT-4 via a two-step complex prompt pipeline
Distilled performance from GPT-4 to smaller open-source models

Key Hyperparameters:

context_length: 4096 tokens
epochs: 2 (convergence)
quantization: 4-bit

Compute: Not reported in the paper

Comparison to Prior Work

vs. Google Fact Check: ClaimVer provides granular claim-level verdicts rather than blanket document labels
vs. FactScore: ClaimVer utilizes Knowledge Graphs for multi-hop evidence rather than just text retrieval [not cited in paper]
vs. AIS/Yue et al.: ClaimVer works on general text (not just QA) and integrates claim decomposition into the pipeline

Limitations

Reliance on Wikidata means coverage is limited to entities present in the KG; obscure or very new information may be missing
KG retrieval depends on accurate NER and Entity Linking; errors upstream propagate to verification
The Woolnet BFS retrieval is computationally expensive compared to simple vector search
Current method limits evidence search to 3 hops, potentially missing complex deep connections

Reproducibility

Code: https://huggingface.co/ClaimVer

Fine-tuned model weights are available on HuggingFace. The instruction prompts used for fine-tuning and the mathematical formulation for the attribution score are fully detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Fine-tuning performance evaluation on claim verification task

Benchmarks:

Custom Distilled Dataset (Claim verification and rationale generation) [New]

Metrics:

ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Custom Dataset	ROUGE-L	Not reported in the paper	0.658	Not reported in the paper

Experiment Figures

Comparison between standard blanket fact-checking (HealthFeedback, Google) and ClaimVer's granular output

Main Takeaways

Small open-source models (2B-10B) can be effectively fine-tuned to perform complex claim decomposition and verification tasks comparable to larger proprietary models (distillation works)
Single-pass processing (decomposition + verification) is viable and reduces the computational overhead compared to processing claims sequentially
The proposed KG Attribution Score (KAS) provides a nuanced metric that distinguishes between 'unsupported' (extrapolatory) and 'false' (contradictory) information

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (structure, entities, triplets)
Named Entity Recognition (NER)
Large Language Models (fine-tuning, prompting)
Breadth-First Search (BFS) algorithms

Key Terms

Knowledge Graph (KG): A structured representation of knowledge where entities are nodes and relationships are edges (triplets)

Triplets: Atomic facts in a KG consisting of (Subject, Predicate, Object), e.g., (Obama, born_in, Hawaii)

Entity Linking: The process of identifying entity mentions in text and mapping them to unique entries in a Knowledge Graph

Woolnet: A multi-node Breadth-First Search algorithm used to find paths/connections between entities in a Knowledge Graph

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

ROUGE-L: A metric for evaluating text generation based on the longest common subsequence between the generated output and reference text

Hallucination: When a model generates plausible-sounding but factually incorrect or unverifiable information

Attribution: The task of linking a generated claim to a specific source that supports it

Zero-shot: The ability of a model to perform a task without seeing any specific training examples for that task