OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

📝 Paper Summary

Factuality Evaluation Model-based Evaluation

OpenFActScore is an open-source reimplementation of the FActScore framework that replaces proprietary dependencies (InstructGPT/ChatGPT) with open models like Olmo and Gemma while maintaining high correlation with the original metric.

Core Problem

The original FActScore metric relies on expensive, closed-source models (InstructGPT, ChatGPT) for atomic fact generation and validation, limiting reproducibility and increasing costs.

Why it matters:

High costs of proprietary API calls prevent large-scale factuality evaluation
Closed-source model updates (e.g., GPT versions changing) make historical comparisons unreliable
The original reliance on legacy LLaMA weights that are no longer easily accessible hinders adoption

Concrete Example: To evaluate a biography generated by a model, the original FActScore requires sending the text to InstructGPT to break it into facts, then to ChatGPT to verify them against Wikipedia. If OpenAI changes these models or raises prices, the evaluation standard shifts or becomes unaffordable.

Key Novelty

Fully Open-Source Factuality Pipeline

Replaces the InstructGPT dependency for Atomic Fact Generation with open models (specifically Olmo) using system prompts and in-context demonstrations
Replaces the ChatGPT/fine-tuned LLaMA dependency for Atomic Fact Validation with open models (specifically Gemma) using retrieval-augmented verification
Refactors the codebase to support any Hugging Face compatible model via chat templates, removing hardcoded dependencies on specific proprietary APIs

Evaluation Highlights

0.99 Pearson correlation with the original FActScore implementation when ranking 10 different LLMs
Gemma and Olmo achieve high semantic overlap (BERTScore ~0.86) with human-annotated atomic facts, approximating closed-source performance
Gemma achieves a cumulative Error Rate of 12.2 relative to human judgment for fact validation, outperforming other open alternatives like Qwen

Breakthrough Assessment

7/10

While not introducing a new metric per se, it democratizes a critical evaluation framework by removing proprietary barriers. The high correlation (0.99) confirms it is a valid drop-in replacement.

⚙️ Technical Details

Problem Definition

Setting: Model-based evaluation of long-form text factuality (specifically biographies)

Inputs: A generated text (biography) and a topic entity (e.g., 'Barack Obama')

Outputs: A scalar FActScore (precision of supported atomic facts / total atomic facts)

Pipeline Flow

Input Generation (Biography)
Atomic Fact Generation (AFG)
Retrieval (GTR)
Atomic Fact Validation (AFV)
Score Calculation

System Modules

Atomic Fact Generator

Decompose sentences into atomic facts using in-context learning

Model or implementation: Olmo (specifically OLMo-7B-Instruct presumably, based on context)

Retriever (Verification)

Find relevant knowledge source documents for a specific fact

Model or implementation: GTR (Generalizable T5-based Retriever)

Atomic Fact Validator (Verification)

Determine if the atomic fact is supported by the retrieved passages

Model or implementation: Gemma (specifically Gemma-7B-Instruct presumably)

Novel Architectural Elements

HFModel Class Refactoring: A unified wrapper for Hugging Face models enabling support for Chat Templates and System Prompts, which were absent in the original OpenAI-centric implementation

Modeling

Base Model: Olmo (for AFG) and Gemma (for AFV)

Comparison to Prior Work

vs. Original FActScore: Replaces all proprietary model calls with open weights (Olmo/Gemma) while maintaining correlation.
vs. SAFE [not cited in paper]: SAFE uses a search engine; OpenFActScore relies on a static Wikipedia dump and dense retrieval, focusing on self-contained open-weight reproducibility.

Limitations

Open models generally perform worse than proprietary models on the raw metrics (BERTScore, Error Rate), even if rankings correlate.
Qwen model failed to follow instructions concisely, producing verbose 'thinking' outputs that broke the evaluation format.
Olmo showed high generation quality but poor validation capability (high Error Rate), necessitating a hybrid pipeline (Olmo for generation, Gemma for validation).

Reproducibility

Code: https://github.com/lflage/OpenFActScore

📊 Experiments & Results

Evaluation Setup

Re-evaluation of the original FActScore benchmark (183 Wikipedia entities)

Benchmarks:

FActScore Benchmark (Human Annotations) (Factuality Evaluation)

Metrics:

BERTScore-F1 (for semantic similarity of generated atomic facts)
Error Rate (deviation from human-annotated FActScore)
Pearson Correlation (ranking agreement with original metric)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Atomic Fact Generation (AFG) evaluation measuring semantic overlap with human-annotated facts.
FActScore Data	BERTScore-F1	0.619	0.867	+0.248
Atomic Fact Validation (AFV) evaluation measuring deviation (Error Rate) from human judgment.
FActScore Data	Cumulative Error Rate	55.6	12.2	-43.4
System-level correlation comparing the proposed OpenFActScore pipeline against the original proprietary FActScore.
10 LLM Biography Outputs	Pearson Correlation	1.0	0.99	-0.01

Main Takeaways

Open-source models can approximate proprietary models for factuality evaluation, with Gemma (validation) and Olmo (generation) being the optimal combination.
Model selection matters: Olmo is excellent at generating facts (high BERTScore) but poor at validating them (high Error Rate), while Gemma is consistent at both.
Chat templates and system prompts are crucial for getting open models to follow the rigid formatting requirements of the FActScore pipeline.

📚 Prerequisite Knowledge

Prerequisites

Understanding of FActScore (Atomic Fact Generation/Validation)
Familiarity with RAG (Retrieval-Augmented Generation) for verification
Basic knowledge of Hugging Face Transformers library

Key Terms

AFG: Atomic Fact Generation—breaking down long sentences into short, indivisible factual claims (atomic facts)

AFV: Atomic Fact Validation—verifying whether an atomic fact is supported by a trusted knowledge source (e.g., Wikipedia)

FActScore: A metric measuring the percentage of atomic facts in a generated text that are supported by a knowledge source

BERTScore-F1: A semantic similarity metric using contextual embeddings to compare generated text against reference text

GTR: Generalizable T5-based Retriever—a dense retrieval model used to find relevant Wikipedia passages

System Prompt: Initial instructions given to a chat model to define its behavior (used here to guide open models to act like InstructGPT)

Chat Template: A formatting standard for converting conversation history into a single string for LLMs (e.g., wrapping user inputs in specific tokens)

Pearson correlation: A statistic measuring linear correlation between two sets of data (here, ranking agreement between original and open metrics)