Evaluation Setup
Instruction tuning followed by evaluation on hallucination and helpfulness benchmarks
Benchmarks:
- TruthfulQA (Truthful question answering)
- FactScore (Factuality evaluation in biography generation)
- HaRa (Hallucination Rate evaluation (various tasks))
- QAMPARI (Retrieval-augmented generation QA)
- SelfCheckGPT (Hallucination detection)
- MIMIC-CXR (Clinical report generation)
Metrics:
- MC1 (TruthfulQA)
- Truthfulness (LLM-judge)
- Helpfulness (LLM-judge)
- Hallucination Rate
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| KCA variants consistently improve truthfulness (MC1) on TruthfulQA across different model backbones compared to the Vanilla SFT baseline. |
| TruthfulQA |
MC1 |
27.6 |
32.4 |
+4.8
|
| TruthfulQA |
MC1 |
31.9 |
39.5 |
+7.6
|
| FactScore |
FactScore % |
44.6 |
50.5 |
+5.9
|
| HaRa |
Hallucination Rate (lower is better) |
39.7 |
32.0 |
-7.7
|
| QAMPARI (RAG) |
Recall-5 |
38.5 |
46.2 |
+7.7
|
| Helpfulness evaluation (using GPT-4 judge) shows that KCA strategies maintain competitive helpfulness while reducing hallucinations. |
| AlpacaEval |
Win Rate vs Davinci003 |
78.4 |
78.8 |
+0.4
|
Main Takeaways
- Mitigating knowledge inconsistency significantly reduces hallucinations across diverse tasks.
- Refusal Tuning is the most effective strategy for pure hallucination reduction but may be conservative.
- Open-book Tuning is best for maintaining helpfulness while still improving factuality, particularly useful for RAG tasks.
- Discard Tuning offers a middle ground but reduces dataset size, which might hurt diversity.
- The method scales effectively across different model sizes (7B to 13B) and architectures (Llama, Mistral, Pythia).