← Back to Paper List

OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

Lucas Fonseca Lage, Simon Ostermann
Philipps-Universität Marburg, German Research Centre for Artificial Intelligence, Centre for European Research in Trusted AI, Saarland University
arXiv (2025)
Factuality Benchmark RAG

📝 Paper Summary

Factuality Evaluation Model-based Evaluation
OpenFActScore is an open-source reimplementation of the FActScore framework that replaces proprietary dependencies (InstructGPT/ChatGPT) with open models like Olmo and Gemma while maintaining high correlation with the original metric.
Core Problem
The original FActScore metric relies on expensive, closed-source models (InstructGPT, ChatGPT) for atomic fact generation and validation, limiting reproducibility and increasing costs.
Why it matters:
  • High costs of proprietary API calls prevent large-scale factuality evaluation
  • Closed-source model updates (e.g., GPT versions changing) make historical comparisons unreliable
  • The original reliance on legacy LLaMA weights that are no longer easily accessible hinders adoption
Concrete Example: To evaluate a biography generated by a model, the original FActScore requires sending the text to InstructGPT to break it into facts, then to ChatGPT to verify them against Wikipedia. If OpenAI changes these models or raises prices, the evaluation standard shifts or becomes unaffordable.
Key Novelty
Fully Open-Source Factuality Pipeline
  • Replaces the InstructGPT dependency for Atomic Fact Generation with open models (specifically Olmo) using system prompts and in-context demonstrations
  • Replaces the ChatGPT/fine-tuned LLaMA dependency for Atomic Fact Validation with open models (specifically Gemma) using retrieval-augmented verification
  • Refactors the codebase to support any Hugging Face compatible model via chat templates, removing hardcoded dependencies on specific proprietary APIs
Evaluation Highlights
  • 0.99 Pearson correlation with the original FActScore implementation when ranking 10 different LLMs
  • Gemma and Olmo achieve high semantic overlap (BERTScore ~0.86) with human-annotated atomic facts, approximating closed-source performance
  • Gemma achieves a cumulative Error Rate of 12.2 relative to human judgment for fact validation, outperforming other open alternatives like Qwen
Breakthrough Assessment
7/10
While not introducing a new metric per se, it democratizes a critical evaluation framework by removing proprietary barriers. The high correlation (0.99) confirms it is a valid drop-in replacement.
×