Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation

📝 Paper Summary

Metrics and evaluation

ICAT is a modular evaluation framework that decomposes long-form text into atomic claims to simultaneously measure both factual accuracy and the coverage of diverse aspects of a topic.

Core Problem

Existing evaluation metrics for long-form generation typically focus on isolated dimensions like factuality or surface-level fluency, failing to assess whether a response comprehensively covers diverse relevant perspectives.

Why it matters:

Applications like medical systems or policy analysis require balanced, complete information; a factually correct but incomplete answer (e.g., listing benefits but omitting side effects) is misleading
Current metrics like ROUGE or BERTScore cannot verify factuality or semantic coverage, while newer factuality metrics (FActScore) ignore information completeness
Optimizing for factuality alone might encourage models to produce short, safe, but uninformative responses

Concrete Example: If a user asks 'What are the health effects of coffee?', an LLM might list only benefits (alertness, disease protection). While factually true, this response is incomplete because it omits risks (anxiety, sleep disruption), presenting a biased view that current metrics would not penalize.

Key Novelty

Information Coverage and Accuracy for Text generation (ICAT)

Decomposes long text into atomic claims, verifies each against a knowledge source, and aligns verified claims to specific topic aspects to calculate a unified score
Introduces a weighted harmonic mean of 'Factuality Score' (ratio of accurate claims) and 'Coverage Score' (ratio of covered aspects), penalizing accurate but narrow responses
Provides three implementation variants (ICAT-M, ICAT-S, ICAT-A) ranging from fully manual ground-truth reliance to fully automated LLM-based aspect generation and alignment

Architecture

The ICAT framework workflow (implied from text description).

Evaluation Highlights

ICAT correlates strongly with human judgments, demonstrating its utility for automated evaluation without human input
Llama-3-70B achieves the highest ICAT scores among open models with Web-based retrieval (0.714 Factuality, 0.556 Coverage)
GPT-4 shows superior performance with Web-based retrieval (0.748 Factuality, 0.551 Coverage), outperforming smaller models like Openchat 3.5

Breakthrough Assessment

8/10

Significant contribution by unifying factuality and coverage into a single interpretable metric. Addresses a critical blind spot in current LLM evaluation (omission bias) with a practical, modular framework.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of long-form text output y generated in response to input query x

Inputs: Input query x, generated text y, and a knowledge source K (corpus or web)

Outputs: Scalar scores: Factuality Score, Coverage Score, and their harmonic mean ICAT_beta

Pipeline Flow

Atomic Claim Generation: y → set of claims C
Claim Grounding: C → set of factually verified claims C_T
Aspect Identification: Input x → set of expected aspects/subtopics
Alignment: Map C_T to aspects
Scoring: Compute Factuality, Coverage, and ICAT_beta

System Modules

Atomic Claim Generator

Break down long text into self-contained atomic statements

Model or implementation: Llama 3.1 8B (fine-tuned with QLoRA)

Retriever (Verification)

Find relevant evidence for each claim

Model or implementation: hf.co/Snowflake/snowflake-arctic-embed-m (dense embedding model)

NLI Verifier (Verification)

Verify if retrieved snippets support the claim

Model or implementation: DeBERTa V3 (fine-tuned on MultiNLI, FEVER, ANLI)

Aspect Generator (Coverage Analysis)

Generate expected aspects for the input query (if not provided)

Model or implementation: Llama 3.1 8B (Base model via VLLM)

Alignment Module (Coverage Analysis)

Map verified claims to the aspects they cover

Model or implementation: Llama 3.1 8B (Base model via VLLM)

Novel Architectural Elements

Integration of aspect-alignment logic directly into the factuality evaluation pipeline
Three-variant design (ICAT-M, ICAT-S, ICAT-A) allowing flexibility between ground-truth reliance and full automation

Modeling

Base Model: Llama 3.1 8B used for Claim Generation, Aspect Generation, and Alignment

Training Method: Supervised Fine-Tuning (SFT) with QLoRA

Adaptation: LoRA (rank=64, alpha=16)

Trainable Parameters: Not reported in the paper

Training Data:

1000 synthetic examples generated by Llama 3.1 405B
Process: Generated 200 topics → 5 entities each → paragraphs → atomic claims

Key Hyperparameters:

learning_rate: 2e-4
batch_size: 16
epochs: 1

Compute: Fine-tuned 8B model uses ~8.75x less memory than 70B zero-shot alternative

Comparison to Prior Work

vs. FActScore: Adds coverage dimension; FActScore rewards safe, short answers, while ICAT penalizes omission of relevant aspects
vs. BERTScore/ROUGE: Evaluates semantic facts and coverage rather than surface-level or embedding similarity
vs. AutoNuggetizer: ICAT provides finer-grained explainability by aligning specific atomic claims to aspects rather than just checking nugget presence
+ 1 more
vs. EXAM++: Validates factual accuracy of the claims supporting the answer, whereas EXAM++ focuses on whether the response answers exam questions

Limitations

Dependency on the quality of the retrieval corpus (e.g., ClueWeb or Web search results)
NLI models and dense embeddings based on small transformers (BERT/DeBERTa) degrade with input length
Smaller 8B models require fine-tuning to perform atomic claim generation effectively compared to 70B+ models
Computational cost of retrieval and NLI verification for every generated sentence

Reproducibility

Code: https://github.com/algoprog/ICAT

Source code available at https://github.com/algoprog/ICAT. Fine-tuned adapter weights for claim generation are mentioned but specific URL not provided in text (likely in repo). Uses public datasets (TREC Web Track, ClueWeb) and open models (Llama 3.1).

📊 Experiments & Results

Evaluation Setup

Evaluate outputs of 4 LLMs using ICAT framework on TREC Web Track queries

Benchmarks:

TREC Web Track (2009-2012) (Search result diversification / Long-form generation)

Metrics:

S_fact (Factuality Score)
S_coverage (Coverage Score)
ICAT_1 (Harmonic mean with beta=1)
Statistical methodology: Pearson/Spearman correlation with human judgments (implied in text, specific r-values not in snippet)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of LLMs using Web-based retrieval for grounding (Brave Search API).
TREC Web Track	Factuality Score (S_fact)	0.714	0.748	+0.034
TREC Web Track	Coverage Score (S_coverage)	0.556	0.551	-0.005
TREC Web Track	ICAT-A1 Score	0.578	0.627	+0.049
Performance of LLMs using Corpus-based retrieval (ClueWeb).
TREC Web Track (ClueWeb)	Factuality Score (S_fact)	0.327	0.343	+0.016
TREC Web Track (ClueWeb)	Coverage Score (S_coverage)	0.335	0.327	-0.008

Experiment Figures

Example of atomic claim generation process

Main Takeaways

Web-based retrieval significantly yields higher factuality scores (approx 0.7+ vs 0.3+) compared to ClueWeb corpus, likely due to better information currency and breadth
There is a trade-off between models: GPT-4 leads in factuality and overall ICAT scores, while Llama-3-70B remains very competitive in coverage
Fine-tuned small models (Llama-3 8B) for claim extraction are more efficient and produce better-formulated claims than zero-shot large models (higher precision 0.838 vs 0.753 for decontextualization)
ICAT scores correlate with human judgments, validating the framework's ability to automate quality assessment

📚 Prerequisite Knowledge

Prerequisites

Familiarity with RAG (Retrieval-Augmented Generation) concepts
Understanding of atomic claim decomposition
Basic knowledge of NLI (Natural Language Inference) for fact verification

Key Terms

atomic claim: A decomposed, self-contained, and decontextualized sentence containing a single fact derived from the original long-form text

ICAT: Information Coverage and Accuracy for Text generation—the proposed framework evaluating both factuality and aspect coverage

NLI: Natural Language Inference—a task determining if a hypothesis (claim) is entailed by, contradicts, or is neutral to a premise (retrieved evidence)

FActScore: A prior metric measuring the ratio of atomic claims supported by a knowledge source, which ICAT extends by adding coverage analysis

DeBERTa: Decoding-enhanced BERT with disentangled attention—a transformer model used here for the NLI task of verifying claims against evidence

TREC Web Track: A standard information retrieval benchmark used here because its queries include aspect-level relevance annotations suitable for measuring diversity

ClueWeb: A large web crawl dataset used as the retrieval corpus for grounding claims in the experiments

QLoRA: Quantized Low-Rank Adaptation—an efficient fine-tuning method that reduces memory usage by freezing the base model and training small adapters