Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

📝 Paper Summary

In-Context Learning (ICL) Chain-of-Thought (CoT) Prompting

For recent strong LLMs, traditional Few-shot Chain-of-Thought exemplars do not improve reasoning capabilities compared to Zero-shot CoT, serving primarily to align output formats rather than guide logic.

Core Problem

Prior research establishing Few-shot CoT as superior to Zero-shot CoT relied on weaker models and flawed evaluation metrics that penalized Zero-shot outputs.

Why it matters:

Blindly using Few-shot CoT increases token costs and latency without actual performance gains for modern models
Existing evaluation benchmarks significantly underestimate the reasoning capabilities of open-source models due to rigid answer extraction logic
Understanding whether models actually 'learn' from context is crucial for designing effective prompting strategies for advanced systems like Qwen2.5

Concrete Example: When a model solves a math problem using Zero-shot CoT, it often formats the answer as '\boxed{42}'. Standard evaluators look for the last number and might parse '42' correctly, but often fail if the format is unexpected or contains extra text, leading to a score of 0 despite correct reasoning. Adding exemplars forces the model to output 'The answer is 42', which evaluators parse correctly, creating the illusion that exemplars improved reasoning.

Key Novelty

Zero-shot CoT Dominance for Strong Models

Identifies a systematic evaluation bias where Zero-shot CoT is penalized for using '\boxed{}' formatting rather than the format expected by standard parsers
Demonstrates that for strong models (e.g., Qwen2.5), adding CoT exemplars (Few-shot) offers no reasoning gain over Zero-shot once evaluation is corrected
Reveals via attention analysis that strong models effectively ignore the reasoning content of exemplars, attending primarily to the problem statement

Evaluation Highlights

Zero-shot CoT (with corrected evaluation) matches or outperforms Few-shot CoT across GSM8K and MATH for strong models like Qwen2.5-72B
Complexity-based exemplar retrieval yields only marginal gains (~0.2%) over Zero-shot for specific configurations, which is attributed to variance
Models retain performance even when 50% of the tokens in the CoT exemplars are replaced with noise, proving they do not rely on the exemplar content

Breakthrough Assessment

7/10

Challenges a fundamental dogma of prompt engineering (that Few-shot CoT > Zero-shot). While not a new architecture, the finding simplifies pipeline design and corrects widespread evaluation flaws.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning using Large Language Models via prompting strategies

Inputs: A math problem statement q

Outputs: A step-by-step reasoning path followed by a final answer

Pipeline Flow

Prompt Construction (Zero-shot or Few-shot)
Inference (LLM Generation)
Post-processing (Format Correction)
Evaluation (Answer Extraction)

System Modules

Prompt Constructor

Assembles the input prompt. For Zero-shot, adds instructions. For Few-shot, retrieves and appends k exemplars.

Model or implementation: Rule-based

LLM Inference

Generates the reasoning path and final answer

Model or implementation: Target LLM (e.g., Qwen2.5-72B-Instruct)

Format Corrector

Parses the output to handle '\boxed{}' or other formats correctly before checking correctness

Model or implementation: Rule-based script

Modeling

Base Model: Qwen2.5 series (0.5B, 1.5B, 7B, 14B, 72B), LLaMA3 series (8B, 70B), Gemma2 (9B), Mistral-8B

Compute: Experiments run using vLLM backend. Specific GPU hardware not reported in the paper.

Comparison to Prior Work

vs. Complexity-based/VoteK: The paper finds these sophisticated selection methods do not outperform simple Zero-shot CoT for strong models, contradicting prior findings on weaker models
vs. Standard Evaluation: The paper introduces a format-aware evaluation that extracts answers from '\boxed{}', revealing higher Zero-shot performance than typically reported

Limitations

Study is limited to mathematical reasoning tasks (GSM8K, MATH); findings may not generalize to other domains like commonsense reasoning
Analysis focuses primarily on open-source models; proprietary black-box models (e.g., GPT-4) are not extensively tested
Does not explore the impact of CoT on extremely long-context tasks

Reproducibility

Code: https://github.com/small-xiangcheng/opencompass/tree/my-changes

Code is publicly available on GitHub. Experiments use fixed random seed 42 and greedy decoding for deterministic results. Full prompt templates and exemplar construction details are described in the paper.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on standard benchmarks

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Challenging mathematics problems)

Metrics:

Accuracy
Statistical methodology: Fixed seed (42) with greedy decoding used to ensure deterministic results; means/std devs not reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval-based few-shot methods show negligible improvement over Zero-shot baselines for strong models.
GSM8K	Accuracy gain	Not reported in the paper	Not reported in the paper	+0.2%

Experiment Figures

Attention visualization of Qwen2.5-7B on GSM8K under Few-shot settings

Ablation study on exemplar noise (replacing tokens in exemplars with random ones)

Main Takeaways

Correcting evaluation bias (extracting answers from \boxed{}) dramatically improves Zero-shot scores, showing that previous gaps between Zero-shot and Few-shot were largely artifacts of parsing failures.
For strong models (Qwen2.5, LLaMA-3), adding traditional CoT exemplars yields no significant reasoning improvement over Zero-shot CoT.
Enhanced exemplars generated by superior models (DeepSeek-R1, Qwen2.5-Max) also fail to improve performance, as strong models tend to ignore the exemplar content.
Ablation studies with noisy exemplars (50% token replacement) show minimal performance drops, confirming that models do not rely on the informational content of the exemplars.
Weak or older models (LLaMA-2-7B, Qwen-7B) *do* benefit from Few-shot CoT, indicating that the ineffectiveness of CoT exemplars is specific to recent, highly capable models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of In-Context Learning (ICL)
Familiarity with Chain-of-Thought (CoT) prompting
Knowledge of Zero-shot vs. Few-shot settings

Key Terms

ICL: In-Context Learning—the ability of a model to perform a task by observing examples in the prompt without parameter updates

CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps before the final answer

Zero-shot CoT: Triggering reasoning by simply appending 'Let's think step by step' without providing example problems

Few-shot CoT: Providing k example problems with their complete reasoning steps (exemplars) in the prompt before the target question

Greedy decoding: A decoding strategy where the model always selects the highest-probability next token (temperature=0)

Exemplars: Input-output pairs provided in the prompt to demonstrate how to solve a task

OpenCompass: An open-source library for evaluating large language models