| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of error detection methods using the exact answer token location. Probing consistently outperforms logit-based and prompting baselines. | ||||
| TriviaQA (Mistral-7b-Instruct) | AUC | 0.72 | 0.83 | +0.11 |
| HotpotQA (Mistral-7b-Instruct) | AUC | 0.62 | 0.74 | +0.12 |
| Natural Questions (Mistral-7b-Instruct) | AUC | 0.63 | 0.75 | +0.12 |
| Generalization experiments showing that probes fail when transferred between dissimilar tasks (e.g., TriviaQA to IMDB). | ||||
| Train: TriviaQA / Test: IMDB | AUC | 0.78 | 0.58 | -0.20 |