← Back to Paper List

VeriFastScore: Speeding up long-form factuality evaluation

Rishanth Rajendhran, Amir Zadeh, Matthew Sarte, Chuan Li, Mohit Iyyer
University of Maryland, College Park, Lambda Labs
arXiv (2025)
Factuality Benchmark RL

📝 Paper Summary

Long-form factuality evaluation Automated evaluation metrics
VeriFastScore accelerates long-form factuality evaluation by training a single Llama-3 model to simultaneously extract and verify claims against bulk evidence, replacing slow multi-step pipelines.
Core Problem
Existing long-form factuality metrics like VeriScore require a slow, multi-stage pipeline (claim decomposition → per-claim retrieval → verification), often incurring ~60 LLM/API calls per response.
Why it matters:
  • High latency (~100 seconds per response) makes current metrics impractical for real-time evaluation or large-scale benchmarks
  • Excessive API costs limit the use of factuality metrics as reward signals for reinforcement learning (RLHF)
  • Standard few-shot prompting of closed models (e.g., GPT-4o) fails at this complex task, achieving low correlation with ground truth
Concrete Example: A 14-sentence response typically yields ~23 claims. VeriScore triggers 14 extraction calls, 23 Google searches, and 23 verification calls. VeriFastScore replaces this with 1 search step and 1 model inference pass.
Key Novelty
Single-Pass Decompose-and-Verify Evaluator
  • Replaces the sequential pipeline of extracting claims then verifying them individually with a single model pass that does both simultaneously using consolidated evidence
  • Uses retrieval based on full sentences rather than atomic claims to gather evidence before decomposition, allowing the model to verify 'in-context' of the search results
  • Trains on high-quality synthetic data generated by the slower, rigorous VeriScore pipeline to distill its capability into a faster, open-weights model
Architecture
Architecture Figure Figure 1
Comparison of the VeriScore pipeline vs. the VeriFastScore pipeline.
Evaluation Highlights
  • Achieves 0.80 Pearson correlation with the rigorous VeriScore pipeline, significantly outperforming GPT-4o few-shot (0.33 correlation)
  • Delivers a 6.64x overall wall-clock speedup (9.9x modeling speedup) compared to the original VeriScore pipeline
  • Maintains strong system-level correlation (r=0.94) with VeriScore rankings, ensuring reliable model comparison at a fraction of the cost
Breakthrough Assessment
8/10
Significantly reduces the cost/time barrier for high-quality factuality evaluation (6x speedup) while maintaining high correlation. The distillation from a slow pipeline to a fast single model is a practical engineering breakthrough.
×