Evaluation Setup
Open-domain QA (single-hop) evaluating both answer accuracy and refusal capabilities
Benchmarks:
- TriviaQA (Factual QA (In-domain))
- Natural Questions (NQ) (Factual QA (Out-of-domain))
- PopQA (Long-tail Factual QA (Out-of-domain))
- TruthfulQA (Factuality/Hallucination (Out-of-domain))
Metrics:
- S_aware (Awareness Score)
- R_k (Recall of Knowns - ratio of correctly answering known questions)
- R_unk (Recall of Unknowns - ratio of refusing unknown questions)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| CoKE significantly improves knowledge boundary awareness (S_aware) compared to the base Llama-3-8B-Instruct model across various datasets. |
| TriviaQA |
S_aware |
0.589 |
0.852 |
+0.263
|
| TruthfulQA |
S_aware |
0.542 |
0.678 |
+0.136
|
| Ablation studies reveal that 'Min-Prob' (minimum token probability) is the most effective signal for estimating model confidence compared to First-Token or Product probabilities. |
| TriviaQA |
Correlation/Performance (Qualitative) |
Lower |
Higher |
Positive
|
Main Takeaways
- Min-Prob is a superior proxy for sequence-level confidence compared to product of probabilities or first-token probability
- Consistency regularization across different prompts (Prior, Direct, Posterior) helps the model internalize the concept of 'knowing' rather than just overfitting to specific phrasings
- The method effectively converts 'Unknown Unknowns' (hallucinations) into 'Known Unknowns' (refusals) without degrading performance on 'Known Knowns'