Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

📝 Paper Summary

Factuality Evaluation Hallucination Suppression Model Calibration

The Refusal Index measures an LLM's ability to refuse unknown questions by calculating the rank correlation between its refusal probability and its error probability using a two-pass evaluation.

Core Problem

Existing factuality metrics fail to accurately measure whether models refuse questions based on actual knowledge gaps; simple refusal rates are biased by model tendencies, while calibration metrics measure proxy processes rather than the model's intrinsic refusal behavior.

Why it matters:

LLMs frequently hallucinate answers with high confidence, necessitating reliable refusal mechanisms for safe deployment
Current metrics like F-score or Weighted Score are inconsistent, fluctuating wildly based on a model's arbitrary refusal threshold rather than its actual knowledge boundary
Standard calibration metrics (ECE) rely on verbalized confidence or auxiliary models, which often misalign with the model's actual generation behavior

Concrete Example: A model instructed to be conservative might achieve a high score simply by refusing everything (high refusal rate), even if it knows the answers. Conversely, a model might refuse random questions rather than difficult ones. Existing metrics struggle to distinguish a model that refuses *because* it doesn't know from a model that refuses due to a conservative prompt.

Key Novelty

Refusal Index (RI) via Two-Pass Evaluation

Defines knowledge-aware refusal as the Spearman rank correlation between a model's likelihood to refuse and its likelihood to be wrong, independent of the absolute refusal rate
Uses a lightweight two-pass process: Pass 1 allows refusal to observe behavior; Pass 2 forces an answer to check correctness. These binary signals are then fitted to a Gaussian copula to estimate the underlying correlation

Architecture

The Two-Pass Evaluation process used to compute the Refusal Index.

Evaluation Highlights

RI demonstrates ~70% lower variability than heuristic metrics (F-score, Weighted Score) when tested on the same model across different refusal-inducing prompts
RI achieves 85% correlation with computationally expensive sampling-based calibration methods (AUROC on P(Answering)), validating it as a faithful calibration proxy
Model family is the strongest predictor of RI performance, with consistent rankings independent of model scale or instruction tuning

Breakthrough Assessment

8/10

Provides a mathematically grounded, robust metric for a critical safety capability (refusal). Solves the long-standing issue of metric instability caused by varying refusal rates.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of black-box LLMs on short-form factual question answering where models can output an answer or a refusal token

Inputs: Input question x

Outputs: Output answer y or refusal symbol ⊥

Pipeline Flow

Pass 1: Standard Evaluation (Observe Refusals)
Pass 2: Forced-Answer Evaluation (Observe Knowledge)
RI Estimation: Gaussian Copula Fit

System Modules

Pass 1 Evaluator (Data Collection)

Query model allowing refusal; record Refusal Indicator R_i

Model or implementation: Target LLM (e.g., Llama 3.1 70B)

Pass 2 Evaluator (Data Collection)

Query model forcing an answer; record Correctness Indicator W_i

Model or implementation: Target LLM (same as Pass 1)

RI Calculator

Estimate Spearman correlation from binary observations

Model or implementation: Gaussian Copula Model

Novel Architectural Elements

Metric logic: Defining refusal quality as rank correlation rather than a weighted sum of accuracy and refusal rates
Inference logic: Two-pass mechanism to reconstruct the joint distribution of refusal and error probabilities from binary black-box outputs

Modeling

Base Model: Evaluated on 16 models including GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B, Qwen2.5-72B

Compute: Inference-only evaluation. Requires 2x forward passes per dataset compared to standard 1x.

Comparison to Prior Work

vs. F-score: RI is stable across varying refusal rates (invariant to threshold shifts), whereas F-score fluctuates wildly depending on the prompt's refusal tendency
vs. ECE: RI does not require access to logits/probabilities and measures rank correlation (discrimination) rather than absolute calibration
vs. AUROC (sampling): RI is computationally cheaper (2 passes vs 100+ samples) while maintaining 85% correlation with this 'gold standard' sampling approach
+ 1 more
vs. Trustworthy-LLM metrics [not cited in paper]: RI focuses specifically on the 'knowledge-aware' component of refusal, decoupling it from safety/content moderation refusals

Limitations

Requires ground truth answers to compute correctness (not applicable to open-ended generation without reference)
Assumes refusal and error probabilities follow a Gaussian copula structure (though goodness-of-fit tests in paper support this)
Does not account for partial correctness; answers must be binary classified as correct/incorrect

Reproducibility

Code: https://github.com/WENBOPAN/Refusal-Index

Code and data publicly available at https://github.com/WENBOPAN/Refusal-Index. System prompts for both passes are detailed in the paper. Evaluation relies on SimpleQA and TruthfulQA datasets which are public.

📊 Experiments & Results

Evaluation Setup

Short-form factual QA where models can answer or refuse. Answers checked against ground truth.

Benchmarks:

SimpleQA (Factual Question Answering)
PreciseWikiQA (Extrinsic Hallucination Detection (Training Data))
FaithEval (subset) (Intrinsic Hallucination Detection (Context))

Metrics:

Refusal Index (RI)
Correct Answer Rate
Refusal Rate
F-score
Weighted Score (p=0.2)
AUROC (using P(Answering) from 100 samples)
Statistical methodology: Spearman's rank correlation for metric definition. Kendall's W and Winner Entropy for ranking stability analysis.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Stability analysis on SimpleQA: Models are prompted with 4 different prompts to induce different refusal rates. A good metric should remain stable for the same model despite these prompt changes.
SimpleQA	Standard Deviation of Score (Normalized)	0.134	0.041	-0.093
SimpleQA	Standard Deviation of Score (Normalized)	0.108	0.041	-0.067
Correlation with Sampling-Based Ground Truth: Comparing RI against AUROC calculated from 100 Monte Carlo samples (P(Answering)).
SimpleQA	Pearson Correlation with AUROC	0.45	0.85	+0.40
Ranking Stability: Measuring how consistently metrics rank models across different datasets/settings after removing monotonic effects of refusal/accuracy rates.
Average across 8 settings	Kendall's W (Residual)	0.18	0.65	+0.47

Experiment Figures

Comparison of Iso-Score Curves for RI vs. Heuristic Metrics (F-score/Weighted Score) on the Accuracy-Refusal plane.

Scatter plot of Refusal Index vs. AUROC (calculated via sampling P(Answering)).

Main Takeaways

RI reveals that while models like GPT-4o achieve high factual accuracy, their ability to know *what* they don't know (refusal calibration) does not scale linearly with accuracy.
Model family is a stronger predictor of refusal capability than model size; certain families (e.g., Claude) consistently outperform others in RI regardless of scale.
Knowledge-aware refusal degrades significantly in noisy context settings (FaithEval) compared to closed-book settings, suggesting models over-rely on context cues rather than internal uncertainty.
Simple refusal-based metrics (F1, Weighted Score) are fundamentally flawed for comparing models because they are dominated by the arbitrary refusal threshold set by the system prompt.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM hallucination and factuality
Basic probability theory (CDF, correlation)
Concept of calibration (prediction confidence vs. actual accuracy)

Key Terms

Spearman's rank correlation: A statistical measure of how well the relationship between two variables can be described using a monotonic function

Gaussian copula: A statistical model used to understand the dependency structure between variables (here, refusal and error) by transforming them to a standard normal distribution

knowledge-aware refusal: The ability of a model to refuse to answer a question specifically because it lacks the knowledge to answer correctly, avoiding both overconfidence and over-refusal

two-pass evaluation: A method where the model is queried twice: once allowing refusals (to check behavior) and once forcing an answer (to check knowledge/correctness)

AUROC: Area Under the Receiver Operating Characteristic Curve—a metric measuring a classifier's ability to distinguish between classes (here, distinguishing correct from incorrect via refusal)

SimpleQA: A dataset of short, fact-seeking questions used to evaluate LLM factuality

Prompting: The process of structuring the input text to guide the LLM's behavior, here used to adjust refusal rates