Evaluation Setup
Binary classification of generated answer correctness (Error Detection)
Benchmarks:
- TriviaQA (Knowledge Retrieval QA)
- HotpotQA (Multi-hop QA)
- Natural Questions (Open-domain QA)
- IMDB (Sentiment Analysis)
Metrics:
- AUC (Area Under ROC Curve)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of error detection methods using the exact answer token location. Probing consistently outperforms logit-based and prompting baselines. |
| TriviaQA (Mistral-7b-Instruct) |
AUC |
0.72 |
0.83 |
+0.11
|
| HotpotQA (Mistral-7b-Instruct) |
AUC |
0.62 |
0.74 |
+0.12
|
| Natural Questions (Mistral-7b-Instruct) |
AUC |
0.63 |
0.75 |
+0.12
|
| Generalization experiments showing that probes fail when transferred between dissimilar tasks (e.g., TriviaQA to IMDB). |
| Train: TriviaQA / Test: IMDB |
AUC |
0.78 |
0.58 |
-0.20
|
Main Takeaways
- Truthfulness information is highly localized in 'exact answer' tokens; probing these tokens yields significantly higher detection accuracy than probing random or final tokens.
- Internal representations encode 'skill-specific' truthfulness; detectors generalize well between similar tasks (e.g., TriviaQA to HotpotQA) but fail across distinct skills (e.g., QA to Sentiment).
- LLMs exhibit a 'knowing-saying' gap: in some cases, internal probes classify the *correct* answer with high confidence, even when the model actually generates an *incorrect* answer.
- Internal states can distinguish between error types (e.g., 'Consistent Error' vs. 'Intermittent Error'), offering more granular diagnostics than confidence scores.