Evaluation Setup
Predicting output properties (format adherence, safety, confidence, final answer) from input representations
Benchmarks:
- NaturalQA / MSMarco / TriviaQA (Format Following & Confidence Estimation)
- WildJailbreak (Safety/Jailbreak Detection)
- SelfAware / KnownUnknown (Abstention Detection)
- 27 Text Classification Datasets (MMLU, etc.) (Chain-of-Thought Acceleration)
Metrics:
- Estimation Consistency (Accuracy of prediction)
- Coverage (Percentage of samples where probe is confident)
- Inference Cost Reduction
- Accuracy Loss
- Statistical methodology: Conformal prediction guarantees (provable bounds on error)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Safety and Alignment: Probes detect when the model will fail to abstain (e.g. jailbreaks) with high precision. |
| WildJailbreak |
Jailbreak Success Rate |
30.0 |
2.7 |
-27.3
|
| WildJailbreak |
Consistency (Accuracy) |
83.1 |
92.3 |
+9.2
|
| Efficiency: Probes accelerate Chain-of-Thought inference by predicting the final answer early. |
| Average across 27 datasets |
Inference Cost Reduction |
0 |
65 |
-65
|
| Average across 27 datasets |
Accuracy Loss |
0 |
0.46 |
+0.46
|
| Format Following: Probes predict if the model will fail to follow bullet/JSON constraints. |
| NaturalQA (Bullets) |
Consistency |
66.0 |
89.0 |
+23.0
|
Main Takeaways
- Input representations contain significant information about future output behaviors, often outperforming fine-tuned BERT models trained on the input text.
- Conformal prediction allows for a tunable trade-off between coverage and consistency, enabling high-precision early warning systems.
- The method scales favorably: larger models (e.g., Llama-3-70B) yield better probe performance than smaller ones.
- Probes demonstrate out-of-distribution generalization, maintaining performance on unseen datasets for tasks like MCQA.