Reducing conversational agents’ overconfidence through linguistic calibration

📝 Paper Summary

Hallucination suppression Calibration of confidence

The paper improves chatbot reliability by training a calibrator to predict answer correctness and using controllable generation to align the model's verbalized confidence (e.g., 'I think...') with that prediction.

Core Problem

State-of-the-art open-domain dialogue agents are 'linguistically uncalibrated': they often express high confidence in factually incorrect answers, misleading users.

Why it matters:

Conversational agents (like BlenderBot) often hallucinate facts while sounding authoritative, which risks misleading users who genuinely don't know the answer
Prior work focuses on factual accuracy or probabilistic calibration (logits), but less on 'linguistic calibration'—whether the generated text itself communicates appropriate doubt
Even if accuracy isn't perfect, 'owning ignorance' via metacognitive features makes agents more transparent and trustworthy

Concrete Example: When asked 'Which is heavier, 1 kg feathers or 1 kg stone?', a SOTA model confidently answers 'Feathers, because they are heavier than a kilogram of any other material.' The proposed system instead responds: 'I'm not sure, but my guess is...'

Key Novelty

Calibrator-Controlled Chatbot Pipeline

Train a 'Calibrator' model that predicts the probability a generated answer is correct based on the model's internal states
Fine-tune the dialogue agent to accept control tokens (<HI>, <LO>, <DK>) that dictate the level of certainty expressed in the text
At inference time, use the Calibrator's prediction to select the appropriate control token, forcing the model to verbalize doubt when it is likely wrong

Architecture

The proposed 'calibrator-controlled chatbot' pipeline.

Evaluation Highlights

Correctness of linguistically confident answers (<HI>) increased from 13.7% (vanilla) to 38.9% (calibrator-controlled) on TriviaQA
The controlled model maintains 88.46% of originally correct answers when forced to generate confident text, compared to 56.81% for naive style control
Off-topic (OT) responses reduced significantly from 2.4% to 0.2% in the calibrated system

Breakthrough Assessment

7/10

Simple but highly effective approach to a critical problem (overconfidence). While it doesn't solve hallucination, it makes systems significantly safer by aligning language with likely accuracy.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering within a dialogue context

Inputs: Natural language question/dialogue history

Outputs: Natural language response with appropriate linguistic confidence markers

Pipeline Flow

Vanilla Generation (Generate initial answer)
Calibrator (Predict P(correct))
Token Selection (Map P(correct) to <DK>/<LO>/<HI>)
Controlled Refinement (Regenerate answer with control token)

System Modules

Vanilla Generator

Generate a candidate response to the user's input

Model or implementation: BlenderBot 2.7B (BST)

Calibrator

Predict the probability that the initial response r_init is factually correct

Model or implementation: MLP classifier on top of pooled encoder/decoder states

Controlled Generator

Regenerate the response to match the target linguistic confidence level while preserving content

Model or implementation: BlenderBot 2.7B (fine-tuned)

Novel Architectural Elements

Two-stage pipeline: Assessing correctness of a generated answer *first*, then rewriting it with a specific linguistic confidence token
Simultaneous control of confidence (<HI>, <LO>) and content consistency (<SAME>) to prevent the model from changing its answer when changing its tone

Modeling

Base Model: BlenderBot 2.7B (BST)

Training Method: Supervised Fine-Tuning (Control Codes)

Objective Functions:

Purpose: Train calibrator to detect correctness.

Formally: Binary Cross Entropy on correctness labels.
Purpose: Train controlled generator to respect style tokens.

Formally: Standard Cross Entropy Loss with prepended control tokens.

Trainable Parameters: Full fine-tuning of the 2.7B parameter model

Training Data:

Calibrator: 50,000 TriviaQA examples (vanilla model responses + match-based correctness labels)
Controlled Generator Stage 1: 25,000 TriviaQA examples + BERT-based confidence labels
Controlled Generator Stage 2: 25,000 TriviaQA examples + <SAME>/<DIFF> tokens to enforce content preservation

Key Hyperparameters:

batch_size: 128
epochs: 4
learning_rate: 7e-6
+ 1 more
dropout: 0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Kamath et al.: Focuses on *linguistic* expression of doubt in dialogue (chat) rather than just abstaining in QA
vs. Jiang et al.: Proposes a remediation pipeline (calibrator + controlled generation) rather than just analysis
vs. RAG approaches [not cited in paper]: Does not use retrieval/external knowledge; focuses on calibrating closed-book knowledge
+ 1 more
vs. Logit-based calibration (Temperature scaling): Controls the *generated text* surface form, not just the probability distribution of tokens

Limitations

Calibrator thresholds (0.0/0.375) result in never generating 'I don't know' (<DK>) explicitly, only <LO> or <HI>
Overall accuracy remains low (~5%); the method aligns confidence but does not improve factual knowledge
Evaluation is limited to TriviaQA (factoid) within a dialogue setup; strictly open-ended conversation is not evaluated for calibration
Requires a multi-step pipeline (generate -> calibrate -> regenerate) which increases inference latency

Reproducibility

Code: https://parl.ai/projects/metacognition/

Data released at https://parl.ai/projects/metacognition/. Code is part of ParlAI framework. Annotations for TriviaQA responses provided. Specific GPU hours not reported.

📊 Experiments & Results

Evaluation Setup

Closed-book QA on TriviaQA using dialogue models

Benchmarks:

TriviaQA (Closed-book Factoid QA)

Metrics:

Linguistic Accuracy (Correctness of answers marked <HI>)
Overall Accuracy
Expected Calibration Error (ECE)
Human-annotated linguistic confidence vs. correctness
Statistical methodology: Paired permutation test for significance

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Calibration performance metrics comparing the vanilla model to the proposed calibrator-controlled pipeline.
TriviaQA (Test)	Correctness of confident (<HI>) answers	13.7	38.9	+25.2
TriviaQA (Test)	Overall Accuracy	4.8	5.1	+0.3
TriviaQA (Test)	Percentage of answers generated confidently (<HI>)	29.45	1.8	-27.65
TriviaQA (Test)	Expected Calibration Error (ECE)	Not reported in the paper	0.018	Not reported in the paper

Main Takeaways

State-of-the-art chatbots (BlenderBot) are severely overconfident, with only 14% of their 'confident' answers being factually correct
Internal representations (hidden states) of the chatbot contain enough signal to predict correctness (ECE 0.018) even when the generation itself is hallucinated
Controlling linguistic style requires conditioning on the *original* answer (<SAME> token); otherwise, changing confidence often changes the factual content
The proposed pipeline effectively reduces 'off-topic' answers (from 2.4% to 0.2%) by forcing the model to address the question with uncertainty rather than dodging it

📚 Prerequisite Knowledge

Prerequisites

Understanding of Seq2Seq Transformer models
Familiarity with calibration (probabilistic vs. linguistic)
Basics of controlled text generation

Key Terms

linguistic calibration: The alignment between a model's verbalized expression of doubt/confidence (e.g., 'I'm sure') and the actual likelihood of the answer being correct

probabilistic calibration: The alignment between a model's numerical confidence scores (logits/probabilities) and empirical correctness

metacognition: The model's ability to assess its own knowledge state (knowing what it doesn't know)

BST 2.7B: BlenderBot 2.7B—a state-of-the-art open-domain dialogue model fine-tuned on Blended Skill Talk tasks

closed-book QA: Answering questions without access to external documents or search, relying solely on parameters

control tokens: Special tokens prepended to the input to guide generation style (e.g., <HI> for high confidence)

ECE: Expected Calibration Error—a metric measuring the weighted average difference between predicted confidence and actual accuracy across bins