Evaluation Setup
Validation of the correlation between Perplexity and Accuracy; Refinement of CoT demonstrations and Fine-tuning data.
Benchmarks:
- DeepMind Mathematics Dataset (Mathematical Reasoning (Linear Equations, Derivatives, Time Difference))
Metrics:
- Perplexity (PPL)
- Accuracy
- Token Count (Efficiency)
- Statistical methodology: Pearson correlation coefficient used to measure relationship between Perplexity and Accuracy.
Main Takeaways
- Statistically significant negative correlation exists between Perplexity (PPL) and Prediction Accuracy across multiple math tasks (Linear Equations, Derivatives, Time Difference).
- Perplexity computed by one model (LLaMA3-7B) strongly correlates with accuracy evaluated by another (GPT-4o-mini), suggesting transferability of the metric.
- The importance of specific reasoning steps varies by model; smaller models (LLaMA3-8B) are more sensitive to step removal than larger models.
- Merging steps is essential for maintaining coherence when 'unimportant' steps contain intermediate values required for subsequent calculations.