KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

📝 Paper Summary

Reinforcement Learning for Factuality Slow-thinking Models (Reasoning Models)

KnowRL mitigates hallucinations in slow-thinking models by integrating a factuality verification reward directly into the reinforcement learning process, encouraging models to verify intermediate reasoning steps against external knowledge.

Core Problem

Slow-thinking models trained with outcome-based RL often generate correct final answers via hallucinated or factually incorrect reasoning chains, reinforcing unreliable thinking patterns.

Why it matters:

Scaling reasoning models (like DeepSeek-R1-Distill) improves complex problem-solving but often exacerbates hallucinations on simple factual tasks due to a lack of process supervision
Outcome-based rewards ignore the factual validity of the intermediate 'thought' process, allowing models to memorize answers while fabricating the logic
Existing fixes like RAG or SFT are either costly to scale or disrupt the reasoning strategies learned during RL

Concrete Example: A model might answer a question correctly but justify it with a fabricated date or event in its chain-of-thought. Outcome-based RL rewards this 'lucky' success, reinforcing the habit of making up facts to reach a goal.

Key Novelty

Factuality-Supervised Group Relative Policy Optimization (GRPO)

Incorporates a fine-grained 'factuality reward' into the RL loop by decomposing reasoning traces into atomic facts and verifying them against a knowledge base
Rewards models not just for the correct answer, but for the proportion of verifiable facts in their reasoning chain, and rewards explicit refusals when knowledge is insufficient

Architecture

The KnowRL framework workflow: Fact-grounded data construction, RL training loop with factuality verification, and policy update.

Evaluation Highlights

Reduces Incorrect Rate on SimpleQA by 20.3 percentage points (from 78.00% to 57.67%) for DeepSeek-R1-Distill-Qwen-7B
Maintains strong reasoning ability on GPQA (improving from 37.37% to 42.42%) while significantly reducing hallucinations
Achieves consistent gains on ChineseSimpleQA (Incorrect Rate -10.0%), demonstrating transfer of boundary-aware behaviors beyond the English training knowledge base

Breakthrough Assessment

8/10

Effective method for fixing the specific 'reasoning-hallucination trade-off' in slow-thinking models. Directly addressing the RL reward signal for intermediate steps is a high-value contribution.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from verifiable feedback for open-domain QA and reasoning tasks

Inputs: Natural language prompt x

Outputs: Reasoning trace o_think and final answer o_answer

Pipeline Flow

Policy Rollout (Model generates group of CoT + Answer)
Atomic Fact Decomposition (Split CoT into facts)
Fact Verification (Check facts against Knowledge Base)
Reward Calculation (Combine factuality, correctness, format scores)
Policy Update (GRPO step)

System Modules

Policy Model

Generate reasoning traces and answers

Model or implementation: DeepSeek-R1-Distill-Qwen-7B or Skywork-OR1-7B-Preview

Atomic Fact Extractor (Verification)

Decompose reasoning trace o_think into atomic facts

Model or implementation: Not explicitly specified (implied rule-based or small model)

Verifier (Verification)

Verify if atomic facts are supported by retrieved knowledge

Model or implementation: DeBERTa-v3-base-mnli-fever-anli

Answer Evaluator

Judge correctness of final answer

Model or implementation: GPT-4o-mini

Novel Architectural Elements

Integration of fine-grained atomic fact verification directly into the GRPO reward loop (process supervision)
Composite reward structure combining intermediate factuality (process) with final answer correctness and format constraints

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-7B and Skywork-OR1-7B-Preview

Training Method: Factuality-Supervised Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize the probability of trajectories with high relative advantages (better factuality/correctness).

Formally: J_GRPO(theta) = E[ min( rho * A_g, clip(rho, 1-eps, 1+eps) * A_g ) ]
Purpose: Prevent model from deviating too far from reference policy.

Formally: KL divergence penalty D_KL( pi_theta || pi_ref )
Purpose: Factuality reward based on supported facts.

Formally: r_fact(o) = (1/M) * sum( v(f_j, K_x) )

Adaptation: LoRA

Training Data:

Factual questions from NqOpen, WebQuestions, ComplexQuestions
Filtered for questions the model answers correctly on first attempt
Knowledge base constructed from Wikipedia entries linked to entities in questions

Key Hyperparameters:

learning_rate: 1e-5 (DeepSeek) / 5e-6 (Skywork)
train_batch_size: 8
num_generations_per_prompt (G): 8
+ 4 more
clip_epsilon: 0.1
kl_beta: 0.01
max_prompt_length: 512
max_completion_length: 1024

Compute: 1x NVIDIA A800 GPU

Comparison to Prior Work

vs. DeepSeek-R1 (Outcome RL): KnowRL adds intermediate process supervision (factuality reward) rather than relying solely on final answer correctness
vs. FactTune-FS: KnowRL optimizes via online RL exploration rather than supervised fine-tuning on static filtered data
vs. RAG: KnowRL internalizes factuality boundaries into the model's policy rather than relying on retrieval at inference time
+ 1 more
vs. Process Reward Models (PRMs) [not cited in paper]: KnowRL uses automated external knowledge verification for rewards instead of training a separate PRM on human annotations

Limitations

Dependency on retrieval and API calls (GPT-4o-mini) during training creates efficiency bottlenecks
Knowledge base coverage is limited to English Wikipedia, though some transfer to Chinese is observed
Does not explore performance ceiling of RL or scenarios where RL might be suboptimal compared to other methods

Reproducibility

Code: https://github.com/zjunlp/KnowRL

Code publicly available at https://github.com/zjunlp/KnowRL. Training data derived from public datasets (NqOpen, WebQuestions). External knowledge base is English Wikipedia dump (20231101). Uses commercial model GPT-4o-mini for reward evaluation.

📊 Experiments & Results

Evaluation Setup

Evaluated on hallucination benchmarks (SimpleQA, TruthfulQA) and reasoning benchmarks (GPQA, AIME 2025)

Benchmarks:

SimpleQA (Factual Question Answering)
ChineseSimpleQA (Factual Question Answering (Chinese))
TruthfulQA (Truthfulness/Hallucination)
GPQA (General Purpose Reasoning (PhD level))
AIME 2025 (Mathematical Reasoning)

Metrics:

Incorrect Rate
Refusal Rate
F1 score
PAQ (Precision on Answered Questions)
ROUGE/BLEU (for TruthfulQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on SimpleQA showing reduction in hallucinations (Incorrect Rate) and improvement in precision (PAQ).
SimpleQA	Incorrect Rate	78.00	57.67	-20.33
SimpleQA	PAQ (Precision on Answered Questions)	24.62	44.64	+20.02
ChineseSimpleQA	Incorrect Rate	68.33	58.33	-10.00
Reasoning benchmarks showing that KnowRL preserves or improves complex reasoning capabilities.
GPQA	Accuracy	37.37	42.42	+5.05
AIME 2025	Accuracy	33.33	33.33	0.00
Ablation study on reward components.
SimpleQA	Incorrect Rate	57.67	71.67	+14.00

Experiment Figures

Scatter plot of GPQA (Reasoning) vs SimpleQA (Factuality) performance for various DeepSeek models.

Training dynamics: Atomic fact accuracy, Correctness, Refusal Rate, and Incorrect Rate over training steps.

Main Takeaways

KnowRL effectively mitigates hallucinations in slow-thinking models by supervising the reasoning process, not just the outcome.
The method improves factual precision (PAQ) and boundary awareness (Refusal Rate) without degrading complex reasoning performance on math/science tasks.
Improvements generalize to languages outside the training knowledge base (ChineseSimpleQA), suggesting the model learns generalizable verification behaviors.
Rewarding appropriate refusals (saying 'I don't know') is a key component; penalizing refusals leads to higher hallucination rates.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy, reward, advantage)
Chain-of-Thought (CoT) reasoning
Knowledge Graphs / Fact Verification

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt to stabilize training

Slow-thinking models: LLMs trained to generate extended 'thought' processes (reasoning traces) before producing a final answer, similar to OpenAI o1 or DeepSeek-R1

Atomic facts: Small, indivisible statements extracted from a longer text that can be independently verified as true or false

FactScore: A metric that evaluates the factuality of long-form text by breaking it into atomic facts and checking what percentage are supported by a knowledge source

DeBERTa: Decoding-enhanced BERT with disentangled attention—a transformer model often used for natural language understanding tasks like entailment and verification

KL divergence: A statistical distance measure used in RL to prevent the new policy from drifting too far from the reference policy

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

GPQA: A challenging dataset for general-purpose question answering requiring PhD-level reasoning

SimpleQA: A benchmark designed to measure the factual correctness of LLMs on short, factual questions

Entropy bonus: A term added to the loss function to encourage the model to maintain diversity in its outputs and prevent collapsing to a single repetitive response