Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

📝 Paper Summary

Hallucination suppression Factuality alignment

KLCF reduces hallucinations by aligning a model's generated text with its own pre-existing internal knowledge using a dual-reward system that penalizes fabrication and rewards recall of verified facts.

Core Problem

Existing RLHF methods rely on preference rewards that ignore the model's internal knowledge boundaries, encouraging models to fabricate facts they don't actually know to satisfy user queries (the 'hallucination tax').

Why it matters:

Standard RLHF often pushes models to generate plausible-sounding but factually incorrect content when the model lacks the underlying knowledge
Current factuality solutions like FActScore rely on slow external retrieval during training, making them computationally expensive and hard to scale for online RL
Existing methods often focus only on precision (avoiding errors), leading to overly conservative models that refuse to answer even when they possess the knowledge

Concrete Example: When asked a complex long-form question, a standard RLHF model might invent specific dates or names to make the answer look complete. In contrast, KLCF restricts the model to only output facts it has previously verified it 'knows' (via a pre-computed checklist) and is confident about, reducing fabrication.

Key Novelty

Knowledge-Level Consistency Reinforcement Learning Framework (KLCF)

Dual-Fact Alignment: Simultaneously optimizes for 'Recall' (mentioning facts the model definitely knows) and 'Precision' (avoiding facts the model is unsure about)
Offline-to-Online Bridge: Instead of retrieving external data during RL training (which is slow), KLCF pre-computes a 'Checklist' of what the model knows and trains a distinct Truthfulness Reward model offline, making the online RL step purely internal and fast

Architecture

Comparison between KLCF (Left) and Previous Methods (Right). It illustrates the Offline Data Preparation phase feeding into the Online RL phase.

Evaluation Highlights

+3.4 to +10.0 improvement in F1 score on LongFact-Obj benchmark compared to RLHF baseline
Reduces hallucination rate by ~4-8 percentage points on LongFact-Obj compared to RLHF
Achieves comparable or better performance than retrieval-dependent methods (like FactTune-FS) while being significantly more efficient during training

Breakthrough Assessment

8/10

Strong conceptual contribution by decoupling factuality enforcement from expensive external retrieval during RL. The 'Checklist' approach effectively operationalizes the 'knowledge boundary' concept for practical training.

⚙️ Technical Details

Problem Definition

Setting: Long-form question answering where the goal is to maximize factuality without external retrieval during inference

Inputs: Natural language query q

Outputs: Structured response containing reasoning <think> and answer <answer>

Pipeline Flow

Offline Data Prep: Base Model Sampling → Claim Extraction → Verification → Checklist & Reward Data Construction
Online RL: Policy Model Generation → Dual Reward Calculation (Checklist + Truthfulness) → GRPO Update

System Modules

Claim Extraction Model (Offline Data Preparation)

Parse model responses into atomic verifiable claims

Model or implementation: Not explicitly specified (lightweight model)

Verifier (Offline Data Preparation)

Check claims against Wikipedia to label them Support/Refute

Model or implementation: Qwen2.5-72B-Instruct

Truthfulness Reward Model

Predict probability P(claim|True) for generated claims during RL

Model or implementation: Trained on balanced dataset of Support/Refute claims

Policy Model

Generate long-form responses with reasoning steps

Model or implementation: Qwen2.5-7B-Instruct / Llama-3.1-8B-Instruct

Novel Architectural Elements

Dual-Fact Alignment Mechanism: Combining a static 'Checklist Reward' (recall) with a dynamic 'Truthfulness Reward' (precision) in the RL objective
External-Knowledge-Free RL Loop: Moving all retrieval and verification to the offline data prep phase, allowing the online RL loop to run without querying external databases

Modeling

Base Model: Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Encourage factual recall and precision against a pre-verified list.

Formally: Harmonic mean of Fact Recall (coverage of checklist) and Fact Precision (correctness against checklist).
Purpose: Penalize fabrication of new claims not in the checklist.

Formally: Average P(True) score from the Truthfulness Reward Model for all extracted claims.
Purpose: Maintain general quality and instruction following.

Formally: Reward from Skywork-Reward-V2-Llama-3.2-1B.
Purpose: Enforce reasoning format.

Formally: Binary reward for correct XML tag structure (<think>, <answer>).
Purpose: Prevent verbosity.

Formally: Piecewise penalty if length exceeds thresholds.

Training Data:

Prompts from ELI5, LongFact-Gen (regenerated via GPT-4), LongWiki-Gen
Wiki20250716 dump for verification
Balanced 2:1 Support/Refute dataset for Truthfulness Reward Model

Key Hyperparameters:

learning_rate: 5e-7 to 1e-6 (depending on model)
batch_size: 128 (global)
beta (KL penalty): 0.01 to 0.04
+ 4 more
max_length: 2048 or 4096
kappa (Checklist weight): 0.25
lambda (Truthfulness weight): 0.25
mu (General reward weight): 0.5

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. FactTune-FS: KLCF is external-knowledge-free during RL (faster) and optimizes recall, not just precision
vs. RLHF: KLCF adds specific factuality constraints based on the model's own knowledge boundary
vs. Self-RAG [not cited in paper]: Self-RAG generates retrieval tokens to control external knowledge use; KLCF focuses on aligning internal parametric knowledge without retrieval

Limitations

Relies on the quality of the offline verification step (using Qwen2.5-72B); errors there propagate to rewards
The construction of the 'Checklist' is static per query; it doesn't adapt if the model learns new facts during RL
Requires significant offline compute to generate samples and verify them against Wikipedia before RL begins

Reproducibility

Code: https://github.com/ki-ljl/KLCF

Code available at https://github.com/ki-ljl/KLCF. Models on HuggingFace. Uses Qwen2.5-72B-Instruct as a verifier and Skywork-Reward-V2 as a general reward model. Detailed prompts provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Long-form generation on diverse benchmarks

Benchmarks:

LongFact-Obj (Long-form factuality generation (objective entities))
LongFact-Cpt (Long-form factuality generation (conceptual topics))
Biographer (Biography generation)

Metrics:

F1 (FactScore-based harmonic mean of recall/precision)
Precision (FactScore)
Recall (FactScore)
Response Length
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on LongFact-Obj showing KLCF outperforms baselines in F1 (balancing precision and recall). Values are F1 scores using Qwen2.5-7B-Instruct backbone.
LongFact-Obj	F1	45.0	48.4	+3.4
LongFact-Obj	F1	46.1	48.4	+2.3
Results on Biographer benchmark using Llama-3.1-8B-Instruct backbone.
Biographer	F1	44.9	51.1	+6.2
Ablation study showing the contribution of each reward component (Checklist vs. Truthfulness).
LongFact-Obj	F1	47.7	48.4	+0.7
LongFact-Obj	F1	45.8	48.4	+2.6

Experiment Figures

Win-rate analysis of KLCF vs. RLHF and SFT on LongFact-Obj.

Main Takeaways

KLCF consistently improves F1 scores across multiple benchmarks (LongFact, Biographer) and base models (Qwen, Llama), showing robustness.
The 'Checklist Reward' is critical for improving Recall (coverage), preventing the model from becoming too conservative (a common issue in precision-only methods).
The framework effectively mitigates the 'hallucination tax' by aligning the model's output with its pre-existing internal knowledge rather than forcing it to hallucinate unknown facts.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Knowledge Distillation / Self-Knowledge concepts
Fact verification (fact checking) pipelines

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance

Parametric Knowledge: Facts stored within the model's neural network weights during pre-training, as opposed to knowledge retrieved from external documents

Knowledge Boundary: The dividing line between what a model actually 'knows' (has stored in weights) and what it fabricates; KLCF aims to keep generation within this boundary

Checklist Reward: A reward signal calculated by comparing the generated text against a pre-computed list of facts the base model is known to possess

Truthfulness Reward: A reward signal from a trained classifier estimating the probability that atomic claims in the output are true

Atomic Claim: A single, indivisible factual statement extracted from a longer sentence (e.g., 'Obama was born in Hawaii')

SFT: Supervised Fine-Tuning—training on labeled examples before RL

Recall: In this context, the percentage of pre-verified 'known' facts (from the checklist) that appear in the generated response

Precision: In this context, the percentage of generated claims that are factually correct

Hallucination Tax: The phenomenon where alignment techniques (like RLHF) degrade a model's factual accuracy by pressuring it to answer questions beyond its knowledge