Evaluation Setup
Multi-step reasoning tasks across arithmetic, commonsense, and logic domains.
Benchmarks:
- GSM8K (Arithmetic Reasoning)
- StrategyQA (Commonsense Reasoning)
- Aqua (Arithmetic Reasoning)
- Date Understanding (Commonsense Reasoning (BigBench))
- Object Tracking (Logical Reasoning (BigBench))
Metrics:
- Accuracy
- Token Consumption
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- LeCo consistently improves reasoning accuracy across varying model sizes (7B to GPT-4) and tasks.
- The logit-based confidence metric effectively identifies error steps (approx 65% detection rate), validating the hypothesis that models are uncertain when they hallucinate.
- Unlike standard self-correction which often increases token usage significantly, LeCo reduces consumption by truncating and reusing correct prefixes.