| Benchmark | Metric | Baseline | This Paper | ฮ |
|---|---|---|---|---|
| Performance on Instruction-Tuned LLMs (Vicuna, WizardLM, LLaMA-2-chat) showing SAR consistently outperforming baselines on TriviaQA and SciQ. | ||||
| Trivia QA | AUROC | 0.630 | 0.749 | +0.119 |
| SciQ | AUROC | 0.675 | 0.741 | +0.066 |
| Trivia QA | AUROC | 0.634 | 0.744 | +0.110 |
| Trivia QA | AUROC | 0.622 | 0.704 | +0.082 |
| CoQA | AUROC | 0.723 | 0.748 | +0.025 |
| Medical Domain Evaluation (MedQA, MedMCQA) showing robustness in specialized domains. | ||||
| MedMCQA | AUROC | 0.685 | 0.717 | +0.032 |