Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation

📝 Paper Summary

Hallucination suppression Confidence estimation

LoVeC uses reinforcement learning to train LLMs to append numerical confidence scores to every generated sentence, achieving efficient on-the-fly calibration for long-form text without expensive post-hoc sampling.

Core Problem

Existing confidence estimation methods for long-form generation are computationally expensive (requiring multiple samples or auxiliary models) or rely on prompt-based verbalization that is often poorly calibrated.

Why it matters:

Hallucinations in high-stakes domains (medicine, law) require reliable confidence signals to trigger safe downstream strategies like selective answering or human review
Post-hoc consistency checks (sampling multiple times) are too slow and costly for real-time applications
Current verbalized confidence methods focus on short-form QA and do not generalize to open-ended long-form generation where certainty varies across sentences

Concrete Example: When an LLM generates a long biography, it might be correct about the birth date but hallucinate the subject's education. A standard model outputs the whole text without differentiation; LoVeC appends '[10]' to the birth date sentence and '[2]' to the hallucinated education sentence during the single decoding pass.

Key Novelty

On-the-fly Sentence-Level Verbalized Confidence via RL

Treats confidence generation as a sequential decision process where the model appends a token (e.g., '10') after every sentence representing its certainty
Optimizes the model using Reinforcement Learning (RL) with a reward signal based on the alignment between the generated confidence score and the ground-truth factuality of the sentence
Introduces 'Iterative Tagging' (grading fixed text) and 'Free-form Tagging' (generating text + grades) as distinct evaluation protocols

Architecture

Overview of the LoVeC framework, illustrating the difference between Free-form Tagging (generating text + confidence) and Iterative Tagging (grading fixed text), and the RL training loop.

Evaluation Highlights

Achieves up to 20x speedup compared to sampling-based state-of-the-art methods (like BS-Detector) while maintaining comparable calibration performance
RL-trained models (specifically using GRPO or DPO) significantly outperform Supervised Fine-Tuning (SFT) and prompting baselines in calibration error metrics (ECE)
Generalizes robustly from long-form training to short-form QA tasks, outperforming models trained specifically on short-form data

Breakthrough Assessment

8/10

Significant efficiency gain (20x) by eliminating sampling, combined with the novel application of RL for granular sentence-level calibration. Addresses a critical bottleneck in deploying reliable long-form generators.

⚙️ Technical Details

Problem Definition

Setting: Long-form Question Answering where output y consists of sentences s_i and confidence scores c_i

Inputs: Input query q

Outputs: Sequence of sentence-confidence pairs y = {(s_1, c_1), ..., (s_n, c_n)}

Pipeline Flow

Input Query q
LLM Policy (generates sentence s_i)
LLM Policy (generates confidence token c_i immediately after s_i)
Reward Calculation (Post-generation/Training only)

System Modules

Policy Model

Generate both the factual statement and the verbalized confidence score in a single pass

Model or implementation: Llama-3-8B-Instruct or Gemma-2-9B-It

Factuality Evaluator

Oracle model verifying the correctness of generated sentences to compute reward

Model or implementation: FactCheck (External tool/oracle)

Novel Architectural Elements

Integration of numerical confidence tokens directly into the generation vocabulary sequence at sentence boundaries, trained via RL to align with external factuality signals

Modeling

Base Model: Llama-3-8B-Instruct and Gemma-2-9B-It

Training Method: Reinforcement Learning (RL)

Objective Functions:

Purpose: Align confidence with correctness.

Formally: r_conf = -λ * ||c_i - f_i||^2 (log-based or quadratic penalty detailed in Appendix)
Purpose: Standard DPO loss (off-policy).

Formally: L_DPO = -E[log σ(β log(π_θ(y_w)/π_ref(y_w)) - β log(π_θ(y_l)/π_ref(y_l)))]
Purpose: On-policy GRPO update.

Formally: E[1/G * Σ (Advantage * Ratio - KL_penalty)]

Trainable Parameters: Full model weights (implied by RL fine-tuning context)

Key Hyperparameters:

beta (KL penalty): Not explicitly reported in the paper
lambda (Reward scaling): Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LUQ/BS-Detector: LoVeC is 'on-the-fly' (single pass) vs. post-hoc multiple sampling (slow)
vs. Just-Ask: LoVeC uses RL training to calibrate scores vs. prompting only (which is often overconfident)
vs. LoGU: LoVeC uses numerical scores for machine interpretability vs. LoGU's natural language ordinal phrases [cited]
+ 1 more
vs. Navigating the Grey Area [not cited in paper]: Both use RL for calibration, but LoVeC focuses specifically on sentence-level long-form generation rather than broad QA confidence.

Limitations

Relies on the accuracy of the oracle fact-checker during training; if the oracle is wrong, the model learns incorrect calibration.
Requires ground truth evidence/factuality labels for the training set.
Experiments limited to two model families (Llama-3, Gemma-2).
Does not explicitly handle 'I don't know' abstention, focusing instead on low confidence scores.

Reproducibility

Code availability is not explicitly provided in the text. Training relies on an external oracle FactCheck system which is described but implementation details (e.g., specific oracle model) are in preliminaries.

📊 Experiments & Results

Evaluation Setup

Long-form QA with verbalized confidence tags

Benchmarks:

ASQA (Long-form QA (Ambiguous))
FactScore (Biography Generation)
StrategyQA (Multi-step reasoning)

Metrics:

Expected Calibration Error (ECE)
Area Under the ROC Curve (AUROC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Calibration performance (lower ECE is better) on Llama-3-8B-Instruct shows LoVeC (via GRPO/DPO) outperforming prompt-based methods.
ASQA (Iterative Tagging)	ECE	0.45	0.12	-0.33
Efficiency comparison shows dramatic speedups over sampling-based methods.
Inference Latency	Speedup Factor	1.0	20.0	+19.0

Main Takeaways

RL-based training (GRPO, DPO) significantly improves confidence calibration compared to supervised fine-tuning (SFT) and prompting.
The method is highly efficient (20x speedup) because it generates confidence tokens in the same pass as the text, avoiding multiple sampling.
The model generalizes well: training on long-form data improves performance on short-form QA tasks, suggesting the learned calibration is robust.
Iterative Tagging and Free-form Tagging provide complementary views on performance, decoupling generation quality from confidence estimation quality.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Confidence Calibration (ECE)
Large Language Models generation decoding

Key Terms

GRPO: Group Relative Policy Optimization—an on-policy RL algorithm that normalizes rewards within a group of samples to reduce variance without a value network

DPO: Direct Preference Optimization—an off-policy method optimizing the model based on preference pairs (winning vs. losing outputs) without an explicit reward model

ORPO: Odds Ratio Preference Optimization—a method incorporating preference alignment directly into the supervised fine-tuning loss

ECE: Expected Calibration Error—a metric measuring the difference between predicted confidence and actual accuracy

Verbalized Confidence: Confidence scores generated explicitly as text tokens (e.g., writing '90%') rather than derived from internal log-probabilities

Iterative Tagging: An evaluation setting where the model is given a fixed answer and must only append confidence scores to each sentence

Free-form Tagging: An evaluation setting where the model generates both the content (answer) and the confidence scores simultaneously