Fine-Tuning Language Models to Know What They Know

📝 Paper Summary

Metacognition Hallucination suppression

ESMA uses Evolution Strategies to align an LLM's explicit knowledge claims with its actual ability to answer correctly, optimizing a joint reward for accuracy and faithful self-assessment.

Core Problem

LLMs lack a reliable dependency between their internal knowledge and explicit reports; they may parrot memorized answers without knowing them, or claim knowledge/ignorance inconsistently with their actual performance.

Why it matters:

Humans rely on a shared internal memory for both answering and reporting knowledge state, but this link is underexplored in LLMs
Existing research focuses on utility (controlling refusal/hallucination) rather than the fundamental dependency between intrinsic knowledge and deployment
Current alignment methods (SFT/RLHF) may encourage superficial pattern matching rather than genuine epistemic reporting

Concrete Example: When asked 'Do you know X?', an LLM might answer 'Yes' and provide a wrong answer, or answer 'No' but actually be capable of answering correctly, showing a disconnect between its meta-assessment and its capabilities.

Key Novelty

Evolution Strategy for Metacognitive Alignment (ESMA)

Utilizes a dual-prompt framework (Direct Question + Meta Question) to measure metacognitive sensitivity (d' type2)
Applies Evolution Strategies (ES) instead of backpropagation to optimize a non-differentiable joint reward that requires coherence between two independent inference passes (answering and self-evaluating)
Reinforces the binding between internal knowledge and output behavior, generalizing to new languages and prompt formats without explicit training on them

Architecture

The ESMA optimization process using Evolution Strategies.

Evaluation Highlights

Qwen2.5 3B with ESMA achieves d' type2 of 1.02, surpassing even closed-source models like GPT-5.2 and Claude Sonnet 4.5
Increases metacognitive alignment (d' type2) from 0.20 to 0.63 using only the top 10% of high-magnitude parameter updates
Achieves AUC ~0.75 in continuous confidence analysis after fine-tuning, reaching the high metacognition regime across all model scales
Zero-shot generalization to 'I don't know' prompting yields significant alignment gains (e.g., 59.88% to 78.07% for 3B model)

Breakthrough Assessment

8/10

Significant methodology for fundamental epistemic alignment. Demonstrates that ES can optimize holistic behavioral consistency where gradient methods fail, with strong generalization across unobserved settings.

⚙️ Technical Details

Problem Definition

Setting: Metacognitive alignment where a model must consistently answer a query and accurately report whether it knows the answer

Inputs: A fact-based question q

Outputs: A direct answer a and a meta-assessment m (Yes/No)

Pipeline Flow

Perturbation Generation (Add Gaussian noise to weights)
Inference (Run Direct Question and Meta Question)
Reward Calculation (Compare coherence of Direct and Meta outputs)
Parameter Update (Weighted average of perturbations based on reward)

System Modules

Candidate Generator

Create population of models with perturbed weights

Model or implementation: Base LLM (e.g., Qwen2.5, Llama3.2)

Evaluator (Direct)

Assess factual correctness of the answer

Model or implementation: Perturbed LLM

Evaluator (Meta)

Assess model's self-reported knowledge

Model or implementation: Perturbed LLM

Parameter Updater

Update model weights using ES rule

Model or implementation: N/A (Optimization Algorithm)

Novel Architectural Elements

Use of Evolution Strategies to optimize a reward function defined across two independent inference passes (Direct and Meta) which breaks the gradient flow required for backpropagation

Modeling

Base Model: Qwen2.5 (0.5B, 1.5B, 3B, 7B), Llama3.2 (1B, 3B), Gemma 3 Instruct (1B, 4B)

Training Method: Evolution Strategies (ES)

Objective Functions:

Purpose: Maximize joint probability of correct answer and accurate self-knowledge reporting.

Formally: R = C + A if C=1 else A (where C=correctness, A=alignment)

Adaptation: Full-parameter fine-tuning via ES

Trainable Parameters: All parameters (though analysis shows sparse updates are effective)

Training Data:

TriviaQA (fact-based questions with short keyword answers)

Compute: Requires only forward passes (no gradient computation), enabling parallel evaluation of population candidates

Comparison to Prior Work

vs. RLHF/Backpropagation: ESMA optimizes across discontinuous contexts (Direct vs. Meta questions) where gradients cannot flow
vs. Utility-focused Metacognition (e.g., hallucination reduction): ESMA focuses on the intrinsic dependency ('knowing what you know') rather than just suppressing wrong answers
vs. Calibration methods (e.g., Kadavath et al., 2022): Optimizes the model parameters to improve the internal representation separability, rather than just calibrating the output probabilities

Limitations

ES is computationally expensive due to the need for a population of candidates
Evaluation primarily focused on short-answer factoid QA (TriviaQA)
Does not explicitly address long-form generation or reasoning tasks

Reproducibility

Prompt templates and hyperparameters are referenced in Appendix D (not included in snippet). Code URL not explicitly provided in the text. Evaluation relies on TriviaQA and FictionalQA.

📊 Experiments & Results

Evaluation Setup

Dual-prompt evaluation (Direct Question for answer, Meta Question for knowledge check) on TriviaQA

Benchmarks:

TriviaQA (Fact-based QA)
FictionalQA (QA on non-existent events (June 2025 release))

Metrics:

d' type2 (metacognitive sensitivity)
Raw Alignment
Accuracy
Yes Ratio
Yes Failure Ratio (YFR)
No Failure Ratio (NFR)
Type 2 AUCROC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ESMA significantly improves metacognitive sensitivity (d') across all open-source models, often surpassing closed-source baselines.
TriviaQA	d' type2	0.20	1.02	+0.82
TriviaQA	d' type2	0.85	1.02	+0.17
TriviaQA	d' type2	0.15	0.76	+0.61
TriviaQA (IDK Prompt)	IDK Alignment	59.88	78.07	+18.19
FictionalQA	d' type2	Not reported in the paper	0.65	Not reported in the paper

Experiment Figures

ROC curves for Type 2 performance (metacognitive sensitivity) using continuous confidence scores before and after ESMA.

Density plots of confidence scores for correct vs. incorrect answers.

Analysis of parameter patching ratio vs. performance improvement.

Main Takeaways

Closed-source models have high raw alignment driven by high accuracy, but lower d' type2, indicating they are less effective at pure correctness discrimination via meta-answers.
ESMA creates a fundamental shift in internal representations, polarizing confidence distributions for correct vs. incorrect answers (separating 0.0 and 1.0 probability clusters).
Improvements generalize to untrained settings: 'I don't know' prompts, cross-lingual tasks (Appendix), and newly learned fictional knowledge.
Parameter efficiency analysis reveals that the top 10% of high-magnitude updates drive ~80% of the metacognitive improvement.

📚 Prerequisite Knowledge

Prerequisites

Signal Detection Theory (d-prime metric)
Evolution Strategies (ES) for optimization
LLM fine-tuning concepts

Key Terms

d' type2: A metric from signal detection theory measuring the distance between internal confidence distributions for correct vs. incorrect judgments; quantifies metacognitive sensitivity independent of response bias

Evolution Strategies (ES): A gradient-free optimization technique that refines parameters by evaluating a population of perturbed candidates, allowing optimization of non-differentiable, holistic reward functions

Dual-prompt method: An evaluation framework using Direct Questions (factual answer) and Meta Questions (self-evaluation) in independent contexts to measure knowledge-behavior consistency

ESMA: Evolution Strategy for Metacognitive Alignment—the proposed method to bind internal knowledge to explicit behavior

Type 2 AUCROC: Area Under the Receiver Operating Characteristic Curve for Type 2 tasks (confidence judgments), measuring the ability to distinguish correct from incorrect responses based on confidence

Raw Alignment: Simple accuracy of the meta-response matching the correctness of the direct answer (susceptible to bias)

YFR: Yes Failure Ratio—proportion of times the model claims to know the answer but gets it wrong

NFR: No Failure Ratio—proportion of times the model claims not to know but could have answered correctly