Language Models Can Predict Their Own Behavior

📝 Paper Summary

Interpretability Safety Guardrails Efficient Inference

The paper demonstrates that internal representations of input tokens alone can preemptively predict eventual LM behaviors (like safety failures or formatting errors) before generation begins, enabling efficient early exits via conformal prediction.

Core Problem

Detecting LM misbehaviors (e.g., jailbreaks, formatting failures) typically requires generating the full output and checking it post-hoc, which is computationally expensive and unsafe.

Why it matters:

Post-hoc detection wastes resources by generating toxic or incorrect tokens before discarding them
Economic and environmental costs of inference grow with model scale and token count
Safety risks arise if the model generates harmful content before a guardrail can intervene

Concrete Example: When a user asks a malicious question (e.g., how to build a weapon), a standard LM generates a compliant response, and only after generation can a guardrail flag it. This wastes compute on the harmful output.

Key Novelty

Conformal Probing of Input Representations

Trains linear classifiers (probes) on the internal activations of the *final input token* to predict properties of the *entire future output sequence* (e.g., will it fail to follow instructions?)
Applies conformal prediction to these probes to guarantee that early warnings or exits only occur when the system is statistically confident, deferring to the base model otherwise

Architecture

Conceptual diagram of the Conformal Probing Early Warning System.

Evaluation Highlights

Reduces successful jailbreak rate by 91% (from 30% to 2.7%) on WildJailbreak by preemptively detecting failures to abstain
Reduces inference costs for Chain-of-Thought classification by 65% on average across 27 datasets with negligible accuracy loss (<1.4%)
Achieves >90% precision in detecting alignment failures (answering unanswerable questions) on SelfAware and KnownUnknown datasets

Breakthrough Assessment

8/10

Strong empirical evidence that input tokens encode future behavior, combined with a rigorous statistical framework (conformal prediction) to make it practical. Significant efficiency/safety gains.

⚙️ Technical Details

Problem Definition

Setting: Given an input prompt x, predict a property y of the eventual output sequence (generated by LM f(x)) using only the internal representation of x.

Inputs: Internal hidden states of the final token of the input prompt

Outputs: Prediction set of potential behaviors (or early exit decision) satisfying a user-defined confidence level

Pipeline Flow

Input Processing (LM computes hidden states for prompt)
Probing (Linear classifier predicts output behavior from hidden state)
Conformal Calibration (Check if prediction confidence meets threshold alpha)
Decision (Exit early/Abstain OR Defer to full LM generation)

System Modules

Input Processor

Compute internal representations for the input prompt

Model or implementation: Llama-3.1-8B (also tested on Mistral-7B, DeepSeek-R1-Distill)

Conformal Probe

Predict eventual behavior and determine if confidence is sufficient to act

Model or implementation: Linear Classifier (Logistic Regression)

Novel Architectural Elements

Integration of conformal prediction with hidden state probing to create a selective early-exit mechanism that guarantees error bounds

Modeling

Base Model: Llama-3.1-8B (primary), Mistral-7B-Instruct-v0.3, DeepSeek-R1-Distill-Qwen-14B

Training Method: Training linear probes (classifiers) on frozen LM activations

Objective Functions:

Purpose: Minimize classification error of the probe.

Formally: Standard Cross-Entropy Loss for the linear classifier.

Trainable Parameters: Linear layer weights (input dimension d to output classes c)

Training Data:

Datasets split into training/calibration/testing
Balanced splits of successes/failures for behavior detection tasks

Key Hyperparameters:

alpha: 0.9 (default confidence level)
training_size: Varied (showed effectiveness with <500 instances)

Compute: Negligible compared to LM inference (linear model training)

Comparison to Prior Work

vs. BERT Fine-tuning: Probes use ~0.0025% parameters and often outperform BERT, showing LM internals are richer than raw text analysis
vs. Post-hoc guardrails: Preemptive detection saves generation costs
vs. Early Exiting [not cited in paper]: Predicts global behavior/properties rather than just the next token

Limitations

Probes struggle with tasks requiring external knowledge not in the model (e.g., factual correctness of MCQ)
Performance correlates negatively with output length (harder to predict longer sequences)
Layer selection for optimal probing is task-specific
Requires a held-out calibration set for conformal guarantees

Reproducibility

Code: https://github.com/DhananjayAshok/LMBehaviorEstimation

Code is publicly available. Datasets used are standard (NaturalQA, MSMarco, TriviaQA, WildJailbreak, etc.). Prompts for formatting are provided in Appendix C.

📊 Experiments & Results

Evaluation Setup

Predicting output properties (format adherence, safety, confidence, final answer) from input representations

Benchmarks:

NaturalQA / MSMarco / TriviaQA (Format Following & Confidence Estimation)
WildJailbreak (Safety/Jailbreak Detection)
SelfAware / KnownUnknown (Abstention Detection)
27 Text Classification Datasets (MMLU, etc.) (Chain-of-Thought Acceleration)

Metrics:

Estimation Consistency (Accuracy of prediction)
Coverage (Percentage of samples where probe is confident)
Inference Cost Reduction
Accuracy Loss
Statistical methodology: Conformal prediction guarantees (provable bounds on error)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Safety and Alignment: Probes detect when the model will fail to abstain (e.g. jailbreaks) with high precision.
WildJailbreak	Jailbreak Success Rate	30.0	2.7	-27.3
WildJailbreak	Consistency (Accuracy)	83.1	92.3	+9.2
Efficiency: Probes accelerate Chain-of-Thought inference by predicting the final answer early.
Average across 27 datasets	Inference Cost Reduction	0	65	-65
Average across 27 datasets	Accuracy Loss	0	0.46	+0.46
Format Following: Probes predict if the model will fail to follow bullet/JSON constraints.
NaturalQA (Bullets)	Consistency	66.0	89.0	+23.0

Experiment Figures

Comparison of estimation consistency for Format Following (Bullets/JSON) across Random, BERT, Linear Probe, and Conformal Probe.

Probe consistency as a function of the input token position.

Main Takeaways

Input representations contain significant information about future output behaviors, often outperforming fine-tuned BERT models trained on the input text.
Conformal prediction allows for a tunable trade-off between coverage and consistency, enabling high-precision early warning systems.
The method scales favorably: larger models (e.g., Llama-3-70B) yield better probe performance than smaller ones.
Probes demonstrate out-of-distribution generalization, maintaining performance on unseen datasets for tasks like MCQA.

📚 Prerequisite Knowledge

Prerequisites

Language Model basics (transformer architecture)
Linear Probing
Conformal Prediction (calibration, coverage, quantiles)

Key Terms

conformal prediction: A statistical framework that produces prediction sets with a guaranteed probability of containing the true label, allowing a system to express uncertainty

linear probe: A simple linear classifier trained on the internal activations of a neural network to detect specific features or properties

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before producing a final answer

jailbreaking: Adversarial attacks that trick a language model into bypassing its safety guidelines to produce harmful content

calibration: The process of adjusting a model's probability estimates to align with its actual empirical accuracy

hidden states: The internal vector representations of tokens within the layers of a neural network

perplexity: A measurement of how well a probability model predicts a sample; used here as a proxy for model confidence