Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

📝 Paper Summary

Hallucination suppression Confidence calibration

Rewarding Doubt fine-tunes LLMs to express calibrated numerical confidence scores alongside their answers by using a reinforcement learning reward function based on the logarithmic scoring rule.

Core Problem

LLMs often generate hallucinations with high confidence, and existing methods either decouple confidence estimation from generation (external probes) or lack inherent calibration awareness (zero-shot prompting).

Why it matters:

In high-stakes fields like medical diagnosis or legal consultation, overconfident hallucinations can lead to dangerous misinformed decisions
Reliable confidence expression is essential for human-AI collaboration, allowing systems to flag uncertain outputs for human review
Current methods that rely on external probes or post-hoc analysis fail to give the model itself an intrinsic awareness of its own uncertainty

Concrete Example: A medical LLM might incorrectly diagnose a rare condition with '100%' confidence, misleading a doctor. A calibrated model using this method would recognize its internal uncertainty and output a low confidence score (e.g., '20%'), signaling the need for verification.

Key Novelty

Rewarding Doubt

Treats confidence estimation as a betting game where the model is rewarded for high confidence on correct answers and penalized for high confidence on incorrect ones
Directly optimizes a strictly proper scoring rule (logarithmic score) via PPO (Proximal Policy Optimization), mathematically guaranteeing that the optimal policy is perfectly calibrated
Integrates confidence calibration seamlessly into the generation process, unlike prior works that use external classifiers or separate preference models

Architecture

The Reinforcement Learning setup where the LLM acts as an agent in a QA environment.

Evaluation Highlights

Reduces Expected Calibration Error (ECE) to 0.05 on TriviaQA, outperforming the best baseline (Trained Probe) by roughly 50%
Achieves 0.86 AUROC on TriviaQA, surpassing both zero-shot methods and DPO-based approaches like LACIE
Demonstrates strong generalization to unseen medical (MedQA) and commonsense (CommonsenseQA) tasks without further fine-tuning, maintaining low ECE

Breakthrough Assessment

8/10

Provides a theoretically grounded RL approach to calibration that outperforms existing probe-based and prompting methods. The generalization to unseen domains suggests it instills genuine uncertainty awareness.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering where the model produces an answer 'a' and a confidence score 'p_hat'

Inputs: Natural language question q

Outputs: Answer-confidence pair (a, p_hat)

Pipeline Flow

Answer Generation: Input Question → LLM → Answer (fixed)
Confidence Generation: Question + Answer → LLM → Confidence Score Token
Reward Calculation: Logarithmic Score based on Answer Correctness and Confidence
PPO Update: Optimize Confidence Generation Policy

System Modules

Answer Generator

Generates the factual answer to the question

Model or implementation: Meta-Llama-3-8B-Instruct (4-bit quantized)

Confidence Estimator

Predicts a numerical confidence score for the generated answer

Model or implementation: Meta-Llama-3-8B-Instruct (LoRA adapters)

Reward Function

Calculates reward based on answer correctness and confidence value

Model or implementation: Mathematical function (Logarithmic Rule)

Novel Architectural Elements

Integration of a proper scoring rule (logarithmic score) directly as an RL reward signal for token-level confidence generation
Two-step generation process during training where answer generation is frozen/separated to focus purely on calibrating the confidence policy

Modeling

Base Model: Meta-Llama-3-8B-Instruct

Training Method: PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward based on calibration accuracy.

Formally: R(a, p_hat, j) = j(a) * log(p_hat) + (1 - j(a)) * log(1 - p_hat)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters only

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 8
epochs: 2 (Single-Answer)
+ 3 more
reward_clipping_epsilon: Small positive constant (implied for log stability)
training_steps: 24,000 (Multiple-Answer)
reward_multiplier: 5 (Multiple-Answer setting)

Compute: One Nvidia A40 GPU; training takes approx. 7 days

Comparison to Prior Work

vs. LACIE: Optimizes a strictly proper scoring rule for factual calibration rather than a listener-speaker preference model
vs. Trained Probe: Teaches the model to verbalize confidence intrinsically rather than relying on an external post-hoc classifier
vs. RLKF: RLKF focuses on refusals for unknown questions, whereas Rewarding Doubt quantifies confidence granularity [not cited in paper, but RLKF is cited]
+ 1 more
vs. PPO-M/PPO-C (Leng et al.): Directly uses analytical scoring rules as rewards instead of training a separate reward model to predict alignment

Limitations

Computational cost is high compared to zero-shot methods (7 days on A40)
Depends on the quality of the correctness judging function (F1 score or exact match)
Focuses only on calibration of confidence, not improving the accuracy of the answers themselves
Requires ground truth labels for reward calculation, limiting application to supervised settings during training

Reproducibility

Code: https://github.com/pasta99/RewardingDoubt

Code is publicly available at https://github.com/pasta99/RewardingDoubt. The paper specifies base models (Llama-3-8B-Instruct), libraries (Unsloth AI), and hyperparameters. Datasets used (TriviaQA, CommonsenseQA, MedQA, QAMPARI) are public.

📊 Experiments & Results

Evaluation Setup

Question Answering with confidence estimation

Benchmarks:

TriviaQA (Open-domain QA (Single-Answer))
QAMPARI (Multiple-answer QA)
CommonsenseQA (Commonsense reasoning (Generalization))
MedQA (Medical QA (Generalization))

Metrics:

Expected Calibration Error (ECE)
Area Under the Receiver Operating Characteristic Curve (AUROC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on TriviaQA showing superior calibration compared to zero-shot and trained baselines.
TriviaQA	ECE	0.10	0.05	-0.05
TriviaQA	AUROC	0.85	0.86	+0.01
Generalization experiments on unseen datasets (MedQA, CommonsenseQA) without further fine-tuning.
MedQA	ECE	0.13	0.04	-0.09
CommonsenseQA	ECE	0.15	0.06	-0.09
Performance on Multiple-Answer task (QAMPARI).
QAMPARI	ECE	0.09	0.02	-0.07

Experiment Figures

Plots of the normalized reward function for correct (blue) and incorrect (orange) answers across confidence levels.

Calibration curves comparing Rewarding Doubt against baselines (Verbalize, Trained Probe) on TriviaQA.

Main Takeaways

Rewarding Doubt consistently achieves the lowest ECE across all tested datasets, indicating superior calibration.
The method generalizes well to unseen tasks (MedQA, CommonsenseQA) without additional training, suggesting the model learns a general concept of uncertainty.
Unlike probe-based methods, this approach enables the model to verbalize confidence directly, making it self-contained.
The method is effective for both single-answer and multiple-answer scenarios.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Proper Scoring Rules (Logarithmic Score)
Calibration Metrics (ECE, AUROC)
Language Model Fine-tuning (LoRA)

Key Terms

ECE: Expected Calibration Error—a metric measuring the difference between predicted confidence and actual accuracy; lower is better

AUROC: Area Under the Receiver Operating Characteristic Curve—a metric measuring how well the confidence score distinguishes between correct and incorrect answers

Logarithmic Scoring Rule: A strictly proper scoring rule where the reward is log(p) if correct and log(1-p) if incorrect, incentivizing honest probability reporting

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps using a clipped objective

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

RLHF: Reinforcement Learning from Human Feedback—training models using rewards derived from human preferences

Proper Scoring Rule: A scoring function where the expected reward is maximized if and only if the predicted probability matches the true probability

LACIE: Listener-Aware Confidence Estimation—a DPO-based baseline that optimizes confidence by simulating speaker-listener interactions

Trained Probe: A baseline method that trains a separate classifier on the model's internal hidden states to predict answer correctness