Improving Model Factuality with Fine-grained Critique-based Evaluator

📝 Paper Summary

Hallucination detection Factuality evaluation RLHF / Preference optimization

FenCE is a fine-grained evaluator trained on public datasets augmented with tool-retrieved documents and textual critiques, which is then used to align generators to be more factual via critique-based revision and preference optimization.

Core Problem

Existing factuality evaluators rely on restricted sources (e.g., Wikipedia only) and output opaque binary scores, while generator training often forces models to hallucinate lesser-known facts they haven't memorized.

Why it matters:

Hallucination remains a persistent issue where LLMs blur the line between memorized facts and plausible-sounding errors
Current evaluator training data lacks diversity in evidence sources and interpretability in feedback
Standard preference training can hurt factuality by reinforcing the generation of obscure facts the model does not actually know

Concrete Example: When a generator claims something incorrect about a specific news event, a standard evaluator might just output '0.2 score' based on a single Wikipedia snippet. FenCE retrieves diverse news articles, generates a critique explaining *why* it's wrong (e.g., 'contradicted by CNN report'), and guides the generator to remove the claim if it's unknown to the model.

Key Novelty

Fine-grained Critique-based Evaluator (FenCE)

Augments public factuality datasets by using tools (Search, Knowledge Graph) to find diverse evidence and prompting strong LLMs to generate explanatory textual critiques alongside labels
Improves generator factuality by using FenCE to critique and revise responses, specifically filtering out 'unknown' facts to prevent forcing the model to hallucinate unmemorized information
Uses FenCE to score both original and revised responses to create preference pairs for DPO (Direct Preference Optimization) training

Architecture

Overview of the FenCE framework, split into (a) Evaluator Training and (b) Generator Training.

Evaluation Highlights

FenCE improves Llama-3-8B-Instruct's balanced accuracy on the LLM-AggreFact benchmark by 8.3%, outperforming significantly larger models like Mistral-Large-123B and Claude-3 Opus
Finely-tuning Llama2-7B-chat with FenCE feedback increases its FActScore factuality rate by 16.86%, surpassing state-of-the-art methods like R-Tuning and PO-SO
On TruthfulQA, the FenCE-trained generator improves by 17.64% relative to the base model, outperforming existing baselines by 3.99%

Breakthrough Assessment

8/10

Strong methodological contribution in data augmentation for evaluators and a sensible training recipe that avoids 'hallucination reinforcement' by filtering unknown facts. Results significantly outperform larger proprietary models.

⚙️ Technical Details

Problem Definition

Setting: Factuality evaluation mapping (claim, document) pairs to labels and critiques; Generator alignment mapping prompts to factual responses

Inputs: Claim c, Source Document d (for evaluator); Prompt x (for generator)

Outputs: Label l in {Supported, Contradictory, Unverified}, Critique r (for evaluator); Response y (for generator)

Pipeline Flow

Evaluator Training: Public Datasets → Augmentation (Tools + Critique Generation) → FenCE SFT
Generator Training: Prompt → Sample Responses → FenCE Critique → Revision (Filter Unknowns) → DPO

System Modules

FenCE Evaluator

Judge claim factuality and generate textual critiques based on source documents

Model or implementation: Llama-3-8B-Instruct

Critique Generator (Augmentation) (Data Augmentation)

Generate explanations for ground-truth labels in training data

Model or implementation: Llama-3-70B-chat

Tool-Augmenter (Data Augmentation)

Fetch diverse documents for claims in training data

Model or implementation: Llama-3-70B-chat + Bing Search / Wikipedia / Google KG

Generator (Target)

Generate factual responses to user prompts

Model or implementation: Llama-2-7B-chat or Llama-3-8B-chat

Novel Architectural Elements

Self-knowledge filtering logic during response revision: specifically prompting the generator 'Is this claim factual?' without external context to detect and remove 'unknown' facts before training
Dual-augmentation pipeline for evaluator training: simultaneously augmenting labels with critiques AND augmenting inputs with multi-tool retrieved documents

Modeling

Base Model: Llama-3-8B-Instruct (for Evaluator); Llama-2-7B-chat / Llama-3-8B-chat (for Generator)

Training Method: SFT followed by DPO

Objective Functions:

Purpose: Train evaluator to generate label and critique.

Formally: Standard conditional language modeling objective maximizing likelihood of (critique, label) given (claim, document).
Purpose: Align generator to prefer factual responses.

Formally: DPO loss L_DPO = -log sigma(beta * log(pi_theta(yw|x)/pi_ref(yw|x)) - beta * log(pi_theta(yl|x)/pi_ref(yl|x)))

Adaptation: Full fine-tuning

Training Data:

Evaluator Training: 10 public datasets (XSum, QAGS, FRANK, etc.) augmented with critiques and tool-retrieved docs
Generator Training: 20k prompts from MixInstruct; Responses revised by FenCE + Self-Knowledge Check

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. R-Tuning: FenCE edits/corrects responses rather than just refusing to answer
vs. PO-SO: FenCE uses a dedicated external evaluator (itself) rather than the generator's self-critique to score preference pairs, reducing self-bias
vs. Fact-Tuning: FenCE incorporates diverse tool-based evidence and explicitly filters 'unknown' facts to avoid hallucination reinforcement
+ 1 more
vs. MiniCheck [not cited in paper]: FenCE generates natural language critiques alongside scores, whereas MiniCheck focuses on binary verification efficiency

Limitations

Depends on the quality of the Llama-3-70B teacher model for generating initial critiques and handling tool calls
Filtering 'unknown' facts relies on the generator's ability to self-diagnose its knowledge, which may not always be calibrated
Computational cost of retrieving documents and generating critiques for every training example is higher than simple scalar scoring

Reproducibility

Code availability is not provided. Public datasets (LLM-AggreFact, FActScore, TruthfulQA) are used. Detailed prompt templates for augmentation and revision are likely in appendices (implied but not explicitly linked in excerpt).

📊 Experiments & Results

Evaluation Setup

Factuality evaluation on LLM-AggreFact; Generator improvement on FActScore and TruthfulQA

Benchmarks:

LLM-AggreFact (Factuality Judgment (Aggregation of 10 datasets))
FActScore (Biography Generation Factuality)
TruthfulQA (QA Truthfulness)

Metrics:

Balanced Accuracy (BAcc)
FActScore (Factuality Rate)
TruthfulQA % True
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluator performance on LLM-AggreFact shows FenCE outperforms much larger models.
LLM-AggreFact	Balanced Accuracy (BAcc)	73.2	81.5	+8.3
LLM-AggreFact	Balanced Accuracy (BAcc)	80.4	81.5	+1.1
LLM-AggreFact	Balanced Accuracy (BAcc)	79.3	81.5	+2.2
LLM-AggreFact	Balanced Accuracy (BAcc)	78.6	81.5	+2.9
Generator improvement results demonstrating the effectiveness of FenCE-based training.
FActScore	Factuality Rate	56.41	73.27	+16.86
FActScore	Factuality Rate	64.44	73.27	+8.83
TruthfulQA	% True	47.78	65.42	+17.64

Experiment Figures

The iterative revision process for generator responses.

Main Takeaways

Augmenting evaluator training data with tool-retrieved documents and textual critiques significantly boosts judgment accuracy, allowing an 8B model to outperform 100B+ proprietary models.
The 'Self-Knowledge Check' (filtering out facts the model doesn't know) is crucial for factuality training; it prevents the common pitfall where RLHF reinforces hallucination of obscure details.
FenCE-based training generalizes well, showing improvements across both biography generation (FActScore) and QA truthfulness (TruthfulQA).

📚 Prerequisite Knowledge

Prerequisites

Instruction tuning / Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
RAG (Retrieval Augmented Generation) concepts

Key Terms

DPO: Direct Preference Optimization—an algorithm that optimizes language models to align with human preferences using pairs of preferred and rejected responses without a separate reward model

SFT: Supervised Fine-Tuning—training a model on a dataset of input-output pairs to learn a specific behavior or task

FenCE: Fine-grained Critique-based Evaluator—the proposed evaluator model that provides scores and textual critiques

Balanced Accuracy (BAcc): A metric calculated as the arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate), useful for imbalanced datasets

FActScore: A metric that decomposes a generation into atomic claims and verifies each claim against a knowledge source (like Wikipedia) to calculate a factuality percentage

LLM-AggreFact: A benchmark aggregating multiple factuality datasets covering tasks like summarization and QA