← Back to Paper List

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Huawei Poisson Lab
Conference on Empirical Methods in Natural Language Processing (2025)
Reasoning Benchmark

📝 Paper Summary

In-Context Learning (ICL) Chain-of-Thought (CoT) Prompting
For recent strong LLMs, traditional Few-shot Chain-of-Thought exemplars do not improve reasoning capabilities compared to Zero-shot CoT, serving primarily to align output formats rather than guide logic.
Core Problem
Prior research establishing Few-shot CoT as superior to Zero-shot CoT relied on weaker models and flawed evaluation metrics that penalized Zero-shot outputs.
Why it matters:
  • Blindly using Few-shot CoT increases token costs and latency without actual performance gains for modern models
  • Existing evaluation benchmarks significantly underestimate the reasoning capabilities of open-source models due to rigid answer extraction logic
  • Understanding whether models actually 'learn' from context is crucial for designing effective prompting strategies for advanced systems like Qwen2.5
Concrete Example: When a model solves a math problem using Zero-shot CoT, it often formats the answer as '\boxed{42}'. Standard evaluators look for the last number and might parse '42' correctly, but often fail if the format is unexpected or contains extra text, leading to a score of 0 despite correct reasoning. Adding exemplars forces the model to output 'The answer is 42', which evaluators parse correctly, creating the illusion that exemplars improved reasoning.
Key Novelty
Zero-shot CoT Dominance for Strong Models
  • Identifies a systematic evaluation bias where Zero-shot CoT is penalized for using '\boxed{}' formatting rather than the format expected by standard parsers
  • Demonstrates that for strong models (e.g., Qwen2.5), adding CoT exemplars (Few-shot) offers no reasoning gain over Zero-shot once evaluation is corrected
  • Reveals via attention analysis that strong models effectively ignore the reasoning content of exemplars, attending primarily to the problem statement
Evaluation Highlights
  • Zero-shot CoT (with corrected evaluation) matches or outperforms Few-shot CoT across GSM8K and MATH for strong models like Qwen2.5-72B
  • Complexity-based exemplar retrieval yields only marginal gains (~0.2%) over Zero-shot for specific configurations, which is attributed to variance
  • Models retain performance even when 50% of the tokens in the CoT exemplars are replaced with noise, proving they do not rely on the exemplar content
Breakthrough Assessment
7/10
Challenges a fundamental dogma of prompt engineering (that Few-shot CoT > Zero-shot). While not a new architecture, the finding simplifies pipeline design and corrects widespread evaluation flaws.
×