R-Tuning: Instructing Large Language Models to Say `I Don't Know'

📝 Paper Summary

Hallucination suppression Knowledge internalization

R-Tuning fine-tunes models to refuse questions beyond their parametric knowledge by first identifying the knowledge gap between training data and the model's internal beliefs, then training on refusal-augmented data.

Core Problem

Standard instruction tuning forces models to complete every answer regardless of whether they possess the relevant knowledge, teaching them to hallucinate rather than admit ignorance.

Why it matters:

Forcing completion on unknown data causes hallucination, as models learn to guess rather than express uncertainty
There is a significant gap between the knowledge in human-labeled instruction datasets and the parametric knowledge acquired during pre-training
Models lacking the ability to say 'I don't know' are unreliable in high-stakes domains where factual accuracy is critical

Concrete Example: A model may not know the capital of a specific country but is forced to output a city name during standard fine-tuning. R-Tuning identifies this gap and trains the model to output 'I am unsure' instead of a hallucinated city.

Key Novelty

Refusal-Aware Instruction Tuning (R-Tuning)

Identifies 'uncertain' data by checking if the pre-trained model can answer the training questions correctly before fine-tuning
Constructs a 'refusal-aware' dataset by appending uncertainty markers (e.g., 'I am unsure') to labels the model got wrong, and certainty markers to those it got right
Treats refusal as a meta-skill that generalizes across tasks, allowing the model to estimate its own uncertainty better than post-hoc methods

Architecture

The R-Tuning process: (1) Measuring knowledge gap by comparing model prediction to label, (2) Constructing refusal-aware data by appending 'I am sure/unsure', (3) Fine-tuning.

Evaluation Highlights

Outperforms Vanilla fine-tuning on MMLU (in-domain) by +12.3 points in Average Precision (AP) using OpenLLaMA-3B
Achieves higher AP scores on out-of-domain datasets (e.g., +5.8 points on ParaRel OOD) compared to vanilla tuning, showing generalized refusal skills
Surprisingly, learning uncertainty during training yields better calibration than simply filtering by uncertainty at test time

Breakthrough Assessment

7/10

Simple yet effective conceptual shift: aligning fine-tuning data with the model's actual knowledge boundary rather than forcing all labels. Strong generalization results for refusal skills.

⚙️ Technical Details

Problem Definition

Setting: Instruction tuning where the training set D is split into known (D1) and unknown (D0) subsets based on the model's parametric knowledge

Inputs: Instruction/Question sequence t

Outputs: Answer followed by an uncertainty expression (e.g., 'I am sure' or 'I am unsure')

Pipeline Flow

Knowledge Gap Identification: Evaluate pre-trained model on training data to split into D_known and D_unknown
Refusal-Aware Data Construction: Append 'I am sure/unsure' to labels based on the split
Instruction Tuning: Fine-tune model on modified dataset
Inference: Generate answer + uncertainty marker to compute confidence

System Modules

Knowledge Gap Identifier (Data Processing)

Classify training samples as 'certain' (prediction matches label) or 'uncertain' (prediction mismatches label)

Model or implementation: Pre-trained Base Model (e.g., OpenLLaMA-3B)

Data Constructor (Data Processing)

Modify training targets to include uncertainty expressions

Model or implementation: Rule-based script

Tuner

Fine-tune the model to learn both the task and the uncertainty expression

Model or implementation: Base Model

Novel Architectural Elements

Training pipeline that dynamically modifies ground-truth labels based on the specific model's pre-existing knowledge state (Gap Identification step)

Modeling

Base Model: OpenLLaMA-3B, LLaMA-7B, LLaMA-13B

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize prediction error for both the answer and the appended uncertainty expression.

Formally: Standard cross-entropy loss L on the sequence of answer and uncertainty tokens.

Training Data:

Two-step process: (1) Inference on training set to split into Correct (D1) vs Incorrect (D0), (2) Appending 'I am sure' to D1 and 'I am unsure' to D0

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 4
epochs: 1

Compute: Nvidia A100-40GB GPUs

Comparison to Prior Work

vs. Vanilla: R-Tuning selectively teaches refusal based on the specific model's knowledge gap, whereas Vanilla forces answers for everything
vs. Post-hoc Uncertainty (Pretrain-W): R-Tuning learns uncertainty as a tunable parameter during training, which improves calibration compared to just filtering pre-trained outputs
vs. R-Tuning (Replacement): The 'Padding' strategy (keeping the label + 'I am unsure') works better than 'Replacement' (just 'I don't know') because it exposes the model to the knowledge even if it doesn't know it yet [Ablation]

Limitations

Requires running inference on the entire training set before fine-tuning (computational cost)
Relies on the assumption that 'prediction matches label' equals 'knowledge', ignoring lucky guesses (though unsupervised consistency check helps)
The padding template is specific; sensitivity to different prompt templates is not fully explored

Reproducibility

Code: https://github.com/shizhediao/R-Tuning

Code is publicly available at https://github.com/shizhediao/R-Tuning. Uses LMFlow library for training. Hyperparameters provided (lr=2e-5, bs=4, 1 epoch).

📊 Experiments & Results

Evaluation Setup

Single-task and Multi-task settings. Measure ability to answer known questions correctly and refuse unknown ones.

Benchmarks:

MMLU (Multiple-Choice QA)
ParaRel (Fill-in-the-blank QA)
HotpotQA (Multi-hop QA)
HaluEval (Hallucination Evaluation QA)

Metrics:

Average Precision (AP)
Accuracy (on willingly answered questions)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Single-task experiments on OpenLLaMA-3B showing R-Tuning's superior calibration (AP score) compared to Vanilla tuning.
MMLU (In-Domain)	AP score	32.6	44.9	+12.3
ParaRel (In-Domain)	AP score	63.3	69.1	+5.8
ParaRel (Out-of-Domain)	AP score	59.9	65.7	+5.8
Multi-task experiments evaluating generalization on unseen datasets (HaluEval).
HaluEval (Out-of-Domain)	AP score	56.0	64.1	+8.1

Experiment Figures

Accuracy comparison on questions the model is 'willing' to answer across datasets.

Radar charts of AP scores for Multi-task experiments.

Main Takeaways

R-Tuning effectively prevents models from answering questions outside their parametric knowledge, reducing hallucination.
The refusal ability acts as a meta-skill that improves with multi-task training and generalizes to out-of-domain tasks.
Training with uncertainty labels (R-Tuning) produces better confidence estimation than post-hoc uncertainty measurement on pre-trained models.
Larger models (13B vs 3B) show greater scalability and improvement in AP scores with R-Tuning.

📚 Prerequisite Knowledge

Prerequisites

Instruction tuning (Supervised Fine-Tuning)
Language model pre-training vs. fine-tuning phases
Hallucination in LLMs

Key Terms

Parametric knowledge: Facts and information stored within the model's weights during pre-training, as opposed to external context

Instruction tuning: Fine-tuning a pre-trained model on datasets formatted as instructions to improve its ability to follow user commands

AP score: Average Precision—a metric that evaluates the quality of uncertainty estimation by ranking predictions by confidence; high AP means correct answers have higher confidence than incorrect ones

Refusal-aware data: Training data modified to include explicit expressions of uncertainty (e.g., 'I am unsure') when the model's internal knowledge contradicts the ground truth

Uncertainty calibration: The alignment between a model's predicted confidence and its actual accuracy

Meta-skill: A generalized ability (here, refusal) that applies across different tasks and domains, not just the specific samples seen during training

Padding method: A data construction strategy where the original label is kept, but a certainty/uncertainty phrase is appended to it

Replacement method: A data construction strategy where the label for unknown questions is completely replaced by a refusal phrase (e.g., 'I don't know')