Evaluation Setup
Zero-shot and Few-shot QA on 18K questions across 4 domains (Movie, Book, Academics, Open/DBpedia)
Benchmarks:
- Head-to-Tail (Factual Question Answering) [New]
Metrics:
- Accuracy (A_LM): Correctness judged by ChatGPT
- Hallucination Rate (H_LM): Incorrect answers excluding 'unsure'
- Missing Rate (M): 'Unsure' or empty answers
- Exact Match (EM)
- Token F1
- ROUGE-L
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Head-to-Tail (Overall) |
Accuracy (A_LM) |
16.8 |
31.0 |
+14.2
|
| Head-to-Tail (Overall) |
Accuracy (A_LM) |
26.3 |
31.0 |
+4.7
|
| Performance breakdown by popularity bucket shows consistent degradation from Head to Tail for GPT-4. |
| Head-to-Tail (Open Domain) |
Accuracy (A_LM) |
17.4 |
47.6 |
+30.2
|
| Head-to-Tail (Open Domain) |
Accuracy (A_LM) |
34.6 |
47.6 |
+13.0
|
| Comparison of Super-Head (Top 10%) vs General Head entities. |
| Head-to-Tail (Top 10% Head) |
Accuracy (A_LM) |
41.6 |
46.2 |
+4.6
|
| Impact of model size on factuality (LLaMA comparison). |
| Head-to-Tail |
Accuracy (A_LM) |
5.3 |
14.7 |
+9.4
|
Main Takeaways
- Factuality follows a Power Law: Accuracy drops consistently from Head to Torso to Tail across all models tested.
- Even 'Head' knowledge is shaky: GPT-4 only answers ~48% of questions correctly about popular entities in the open domain.
- Instruction Tuning effect: Instruction-tuned models (Vicuna) tend to be more conservative (higher Missing Rate) than base models (LLaMA), but do not necessarily have higher factual accuracy.
- Domain sensitivity: Performance is highest in popular domains like Movies and lowest in niche domains like Academics.
- Prompt robustness: Asking models to be 'unsure' significantly reduces hallucination rates compared to forcing an answer.