Evaluation Setup
Evaluate response factuality and quality on diverse downstream tasks.
Benchmarks:
- TruthfulQA (Question Answering (measuring truthfulness))
- XSum (Text Summarization)
- Wizard of Wikipedia (WoW) (Knowledge-grounded Dialogue)
- HaluEval (Hallucination Evaluation)
Metrics:
- ROUGE-L
- BERTScore
- Factuality Metrics (specifics not detailed in snippet but implied)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- CDT significantly improves performance across all evaluated tasks (Summarization, QA, Dialogue) compared to the base model and other decoding interventions.
- The Mixture-of-Experts strategy is crucial for handling different types of hallucinations (e.g., intrinsic vs. faithful) that appear in different tasks.
- The adversarial training mechanism for the truthful comparator effectively prevents overfitting and enhances robustness.
- The framework maintains generation fluency while improving factuality, unlike some penalty-based methods that degrade coherence.