Evaluation Setup
Zero-shot factuality evaluation across diverse tasks
Benchmarks:
- TruthfulQA (Multiple-choice factuality)
- FACTOR (Factuality evaluation (News/Wiki))
- StrategyQA (Reasoning / Question Answering)
- GSM8K (Chain-of-Thought Reasoning)
Metrics:
- Accuracy (Acc)
- Factual Accuracy
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- SLED consistently improves factual accuracy compared to standard decoding and DoLa across varied model sizes (1B-45B).
- The method is robust to the choice of layers, unlike DoLa which is sensitive to the candidate layer set size.
- The 'soft estimation' strategy (using a target distribution) outperforms 'hard estimation' (picking a single token), suggesting preserving uncertainty is beneficial.
- SLED works synergistically with other decoding strategies, capable of being combined for further gains.