CATTO: Balancing Preferences and Confidence in Language Models

📝 Paper Summary

LLM Alignment Confidence Calibration

CATTO integrates a differentiable, token-level calibration loss directly into preference optimization (like DPO) to prevent the confidence drift that typically occurs during alignment.

Core Problem

Preference alignment methods like DPO optimize relative likelihoods but leave absolute probability scales unconstrained, causing models to become severely miscalibrated (overconfident in wrong answers, underconfident in right ones).

Why it matters:

LLMs are increasingly deployed in decision-making settings (medical, legal) where reliable confidence estimates are critical for safety and trust.
Post-hoc calibration methods (like temperature scaling) do not persist after further training, and existing training-time methods do not survive the logit drift caused by preference optimization.
Current alignment techniques break the link between a model's predictive probability and its actual correctness frequency.

Concrete Example: After DPO alignment, a model might predict an incorrect answer with 99% confidence because the optimization pushed logits to extreme values to satisfy preference pairs, whereas a well-calibrated model would assign it a low probability reflecting its true uncertainty.

Key Novelty

Calibration Aware Token-level Training Objective (CATTO)

Introduces a differentiable surrogate for Expected Calibration Error (ECE) that operates per-token, allowing calibration to be optimized via gradient descent.
Combines this calibration loss linearly with the standard DPO objective, constraining absolute confidence levels while simultaneously optimizing for human preferences.
Uses a margin-based correctness surrogate (difference between ground truth and best incorrect token) to provide a smooth training signal for calibration.

Architecture

Conceptual illustration of miscalibration in DPO vs. CATTO.

Evaluation Highlights

Reduces Expected Calibration Error (ECE) by 2.22%-7.61% compared to standard DPO on in-distribution benchmarks.
Outperforms the strongest DPO baseline (RCFT) by 0.22%-1.24% in ECE while requiring significantly less compute.
Maintains or improves downstream accuracy (+3.16% average) across five datasets, unlike other calibration methods that often trade off accuracy.

Breakthrough Assessment

8/10

Offers a principled, theoretically grounded solution to the known problem of miscalibration in RLHF/DPO. The method is efficient (no extra parameters) and effective, addressing a critical safety/reliability gap.

⚙️ Technical Details

Problem Definition

Setting: Aligning Language Models to preferences while maintaining probabilistic calibration.

Inputs: Input prompt x and candidate responses (preferred y+, dispreferred y-)

Outputs: Aligned model policy π_θ that produces both high-quality responses and calibrated confidence scores.

Pipeline Flow

Input Processing (Tokenization)
LLM Forward Pass (Logits)
Loss Calculation (DPO + CATTO)
Optimization (Gradient Update)

System Modules

LLM Backbone

Predicts next-token logits given context.

Model or implementation: Llama-3-8B-Instruct (and other variants)

CATTO Loss Module (Optimization)

Calculates differentiable calibration error.

Model or implementation: Mathematical Function (Eq. 6)

DPO Loss Module (Optimization)

Calculates preference alignment loss.

Model or implementation: Standard DPO Formulation

Novel Architectural Elements

Integration of a per-token differentiable calibration objective directly into the preference optimization loop.

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: DPO + CATTO (Joint Optimization)

Objective Functions:

Purpose: Align model to preferences.

Formally: L_DPO = -E[log σ(β * log(π_θ(y+|x)/π_ref(y+|x)) - β * log(π_θ(y-|x)/π_ref(y-|x)))]
Purpose: Calibrate confidence to correctness.

Formally: L_Cal = |c_θ(x_t) - z_tilde(x_t)|, where z_tilde is the differentiable correctness surrogate.
Purpose: Joint Training.

Formally: L_Total = L_DPO + λ * (L_Cal(y+) + L_Cal(y-))

Adaptation: Full fine-tuning (implied, or LoRA depending on exact experimental setup, usually full for DPO papers unless specified)

Trainable Parameters: All model parameters

Training Data:

Preference pairs (x, y+, y-) from standard datasets (e.g., UltraFeedback, HelpSteer)

Key Hyperparameters:

beta: 0.1 (DPO)
lambda: 0.5 (Calibration weight)
learning_rate: 5e-7
+ 2 more
batch_size: 64
warmup_steps: 150

Compute: Comparable to standard DPO training; incurs negligible overhead relative to RCFT (18-39x cheaper than RCFT).

Comparison to Prior Work

vs. RCFT: CATTO is single-stage (joint training) and token-level, whereas RCFT is two-stage and bin-level.
vs. Label Smoothing: CATTO targets calibration dynamically based on correctness probability, whereas Label Smoothing applies a static penalty.
vs. PPO-cal [not cited in paper]: PPO-cal incorporates calibration into PPO; CATTO does so for DPO, avoiding the complexity of PPO.

Limitations

Depends on the quality of the correctness surrogate; approximations may not perfectly reflect true correctness.
Currently evaluated primarily on multiple-choice and short-form QA; applicability to open-ended generation is less direct.
Requires balancing the lambda hyperparameter between preference satisfaction and calibration.

Reproducibility

Code: https://github.com/copenlu/catto

Code is publicly available at https://github.com/copenlu/catto. Hyperparameters are detailed in the paper appendices.

📊 Experiments & Results

Evaluation Setup

Evaluation on diverse benchmarks for calibration (ECE) and utility (Accuracy).

Benchmarks:

MMLU (Knowledge QA)
TruthfulQA (Truthfulness)
ARC-Challenge (Reasoning)
HellaSwag (Commonsense Reasoning)
GSM8K (Math Reasoning)

Metrics:

Expected Calibration Error (ECE)
Accuracy
Brier Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
In-distribution calibration results showing CATTO reduces ECE compared to baselines.
Average across 5 datasets	ECE (%)	12.45	4.84	-7.61
Average across 5 datasets	ECE (%)	6.08	4.84	-1.24
Accuracy results showing CATTO maintains or improves performance.
Average across 5 datasets	Accuracy (%)	58.12	61.28	+3.16
Out-of-distribution generalization results.
Average OOD datasets	ECE (%)	15.67	5.23	-10.44

Main Takeaways

CATTO significantly reduces calibration error (ECE) both in-distribution and out-of-distribution compared to standard DPO.
Unlike many calibration techniques that degrade accuracy, CATTO maintains or slightly improves downstream task accuracy.
The method is computationally efficient, avoiding the costly multi-stage training of methods like RCFT.
Confidence@k scaling demonstrates that better calibration directly translates to better test-time decision making (reranking).

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Confidence Calibration (ECE)
Logits and Softmax Probabilities
Gradient Descent Optimization

Key Terms

DPO: Direct Preference Optimization—an algorithm for aligning language models to human preferences by solving for the optimal policy in closed form, avoiding a separate reward model.

ECE: Expected Calibration Error—a metric measuring the average difference between a model's predicted confidence and its actual accuracy.

Confidence Drift: The phenomenon where a model's probability estimates shift away from true correctness probabilities during training (e.g., becoming overconfident).

Logits: The raw, unnormalized scores output by the final layer of a neural network before applying softmax.

Temperature Scaling: A post-hoc calibration technique that divides logits by a scalar value T to adjust the entropy of the output distribution.

RCFT: Regularized Calibration-Aware Finetuning—a baseline method that applies calibration as a separate supervised fine-tuning phase after alignment.

Probability Margin: The difference in predicted probability between the correct token and the highest-probability incorrect token.

Bayes-optimal: A decision rule that minimizes the expected loss (or maximizes expected utility) given the true posterior probability distribution.