← Back to Paper List

FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo
University of California, Los Angeles, Hong Kong University of Science and Technology (Guangzhou), University of Cambridge
arXiv (2025)
Factuality Benchmark RAG

📝 Paper Summary

Long-form factuality evaluation Hallucination detection
FaStfact improves long-form factuality evaluation by using dynamic chunking and confidence-based pre-verification to reduce costs, while scraping full web pages instead of snippets to ensure sufficient evidence.
Core Problem
Existing long-form factuality evaluators are inefficient due to redundant sentence-level processing and ineffective due to insufficient evidence (short snippets) and inaccurate claim extraction.
Why it matters:
  • Inefficient pipelines (high time/token costs) cannot scale to evaluate long documents generated by modern LLMs
  • Existing methods frequently produce unverifiable, redundant, or missing claims because sentence-level processing misses global context
  • Reliance on short search snippets (20-40 words) often leads to 'inconclusive' verification even when ample evidence exists online
Concrete Example: In a case study of SAFE, 68% of claims extracted from a GPT-3.5 response were problematic (redundant or unverifiable). Furthermore, verifiers often lack context because they only see short Google search snippets, leading to false negatives where true claims are labeled 'not enough evidence'.
Key Novelty
FaStfact (Chunk-based Extraction + Pre-verification + Full-Page Evidence)
  • Replaces sentence-level extraction with dynamic chunking to process larger contexts at once, reducing inference calls and capturing inter-sentence dependencies
  • Introduces confidence-based pre-verification where the LLM verifies 'easy' claims using internal knowledge, skipping external search if confidence is high
  • Fetches full web page content instead of short search snippets to create a comprehensive document-level knowledge base for 'hard' claims requiring external verification
Evaluation Highlights
  • Achieves highest alignment with human evaluation compared to baselines like FActScore and SAFE on the new FaStfact-Bench
  • Significantly reduces processing time and token costs compared to SAFE and FActScore due to dynamic chunking and pre-verification
  • Reduces the rate of inconclusive verifications by providing full document-level evidence rather than truncated snippets
Breakthrough Assessment
8/10
Strong engineering contribution that fixes major efficiency bottlenecks in the standard decompose-then-verify pipeline while simultaneously improving evidence quality. The release of a fine-grained annotated benchmark is also valuable.
×