TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

📝 Paper Summary

Hallucination suppression Reinforcement Learning from Human Feedback (RLHF)

TruthRL optimizes LLM truthfulness using reinforcement learning with a ternary reward system that explicitly incentivizes abstention over hallucination while rewarding correctness.

Core Problem

Existing methods optimize for accuracy, which incentivizes guessing and hallucinations when the model lacks knowledge, or use conservative fine-tuning that causes excessive abstention.

Why it matters:

In high-stakes domains (law, medicine), confident hallucinations are far more dangerous than admitting ignorance ('I don't know')
Standard accuracy-driven objectives (SFT, binary RL) implicitly favor guessing because the expected reward for a lucky guess is higher than abstaining
Current methods struggle to balance the trade-off between answering correctly and recognizing knowledge boundaries, often swinging to extremes of fabrication or silence

Concrete Example: On the CRAG benchmark, vanilla SFT models achieve high accuracy but rarely abstain (near 0% uncertainty rate), leading to high hallucination on difficult questions. Conversely, R-Tuning requires complex data annotation and can become overly conservative.

Key Novelty

Truthfulness-Driven Ternary Reward Optimization

Introduces a ternary reward structure (+1 for correct, 0 for abstention, -1 for hallucination) that makes abstaining mathematically preferable to guessing when confidence is low
Uses GRPO (Group Relative Policy Optimization) to apply this reward, allowing the model to learn the relative advantage of admitting ignorance over risking a hallucination
Optimizes a composite 'Truthfulness' metric rather than raw accuracy, explicitly balancing correctness, uncertainty (abstention), and hallucination reduction

Architecture

Comparison of Accuracy-Driven vs. Truthfulness-Driven objectives. It illustrates how binary rewards (Accuracy) push models to guess, whereas ternary rewards (Truthfulness) create a safe 'Abstain' zone.

Evaluation Highlights

Reduces hallucination rate by 28.9% and improves truthfulness score by 21.1% compared to vanilla RL on knowledge-intensive benchmarks
On challenging CRAG subsets where baselines hallucinate nearly 100% of the time, TruthRL abstains 84.5% of the time with only 15.5% hallucinations
Outperforms both vanilla SFT and knowledge-enhanced baselines (RFT, R-Tuning) across Llama-3.1-8B and Qwen-2.5-7B models in both retrieval and non-retrieval settings

Breakthrough Assessment

8/10

Provides a principled, simple solution (ternary rewards) to a critical safety problem (hallucination) that outperforms complex alternatives. The shift from accuracy-driven to truthfulness-driven RL is a significant conceptual improvement.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering where the model must choose to answer or abstain based on internal knowledge confidence

Inputs: Natural language question x (potentially with retrieved context)

Outputs: Response y (either a direct answer or an abstention like 'I don't know')

Pipeline Flow

Policy Model (generates G responses per prompt)
Reward Function (evaluates responses as Correct, Abstain, or Hallucination)
GRPO Update (optimizes policy based on relative advantages of the G responses)

System Modules

Policy Model

Generate G candidate responses for input question x

Model or implementation: Llama-3.1-8B-Instruct or Qwen-2.5-7B-Instruct

Reward Function

Assign scalar rewards to each generated response based on ground truth correctness

Model or implementation: Deterministic rule based on correctness check

GRPO Optimizer

Update model weights using the group relative advantage

Model or implementation: Gradient Descent

Modeling

Base Model: Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected truthfulness score.

Formally: Maximize sum of importance-weighted advantages, subject to KL divergence constraints.
Purpose: Define the advantage for ternary rewards.

Formally: Advantage A_i = (r_i - mean(r_group)) / std(r_group), where r_i ∈ {1, 0, -1}.

Training Data:

Trained on CRAG dataset
Evaluated on CRAG, NQ, HotpotQA, MuSiQue

Key Hyperparameters:

group_size_G: Not explicitly reported in the paper
beta: Not explicitly reported in the paper
epsilon: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Vanilla RL: TruthRL uses ternary rewards (0 for abstention) vs. binary (incorrect/abstain both penalized), allowing the model to learn a 'safe' option.
vs. R-Tuning: TruthRL optimizes via RL exploration rather than static SFT targets, preventing over-conservatism.
vs. RLHF (PPO): Uses GRPO for simpler optimization without a separate critic model [not cited in paper as direct baseline, but architectural difference implied].

Limitations

Depends on accurate correctness evaluators (ground truth checking) during training.
Ternary reward weights (+1, 0, -1) are heuristic; sensitivity to these values is not fully explored.
Requires ground truth labels, making it less applicable to purely open-ended generation without reference answers.

Reproducibility

Prompt templates, retrieval setup, and reward definitions are described in text. Exact hyperparameters (learning rate, batch size, GRPO specific beta/epsilon) are not explicitly detailed in the main text. Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Knowledge-intensive QA with and without retrieval

Benchmarks:

CRAG (RAG benchmark with various question types)
Natural Questions (NQ) (Open-domain QA)
HotpotQA (Multi-hop QA)
MuSiQue (Multi-hop QA)

Metrics:

Truthfulness Score (w1=1, w2=0, w3=1)
Hallucination Rate (Hall)
Accuracy (Acc)
Uncertainty Rate (Unc)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Without retrieval, TruthRL significantly reduces hallucinations and improves truthfulness compared to baselines.
Average (4 datasets)	Truthfulness Score	-50.3	10.5	+60.8
Average (4 datasets)	Hallucination Rate	75.2	20.5	-54.7
With retrieval, TruthRL maintains high accuracy while minimizing hallucinations, unlike binary RL which trades safety for slight accuracy gains.
Average (4 datasets)	Truthfulness Score	4.5	25.6	+21.1
Average (4 datasets)	Hallucination Rate	47.7	18.8	-28.9

Experiment Figures

Performance breakdown on CRAG (Retrieval setup) for full set vs. difficult subset.

Main Takeaways

TruthRL consistently outperforms vanilla SFT and RL baselines across all benchmarks, especially in truthfulness and hallucination reduction.
Simple ternary rewards generally outperform complex knowledge-enhanced or reasoning-enhanced reward designs.
The method is robust to 'hallucination-baiting' questions (e.g., comparison questions), where baselines frequently fail.
Unlike baselines that are either overly confident (SFT) or overly conservative (R-Tuning), TruthRL learns an effective boundary, abstaining primarily when it actually lacks knowledge.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO basics)
Large Language Models (SFT vs. RLHF)
Hallucination and Factuality metrics

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of sampled outputs for the same input, eliminating the need for a separate value function network

Ternary Reward: A three-valued reward signal (+1 correct, 0 abstain, -1 incorrect) designed to make abstention a safe middle ground between success and failure

SFT: Supervised Fine-Tuning—training a model to mimic ground-truth answers

R-Tuning: A baseline method that fine-tunes models on datasets where unanswerable questions are explicitly labeled with 'I don't know'

Truthfulness Score: A composite metric defined as w1*Accuracy + w2*Uncertainty - w3*Hallucination

RAG: Retrieval-Augmented Generation—providing external documents to the model to aid in answering questions

Hallucination: Plausible but factually incorrect statements generated by the model

Abstention: The model explicitly refusing to answer (e.g., 'I don't know') when uncertain

OOK: Out-of-Knowledge—questions where the model's internal parametric knowledge is insufficient to answer correctly