Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors

📝 Paper Summary

Mechanistic Interpretability Prompt Engineering

This study reveals that while Chain-of-Thought reasoning heavily relies on pretrained priors, providing sufficient exemplars can override these priors (even with noise), and prompt engineering can induce reasoning-intensive "slow thinking" in smaller models.

Core Problem

The underlying mechanism of Chain-of-Thought (CoT) reasoning is unclear: it is debated whether models actually learn new reasoning skills from prompts (In-Context Learning) or merely retrieve latent knowledge (Pretrained Priors).

Why it matters:

Current perspectives are conflicting: some argue models only imitate formats without learning logic, while others suggest they can learn novel mappings
Understanding this balance is crucial for designing prompts that effectively leverage model capabilities without introducing hallucinations or instability
Disentangling ICL from priors helps explain when and why CoT fails (e.g., under misleading prompts) or succeeds (e.g., reasoning emergence)

Concrete Example: In a Coin Flip task (binary outcome), if a prompt contains exemplars with incorrect 'flipped' logic, does the model stick to its pretrained knowledge of coin physics, or does it copy the error? This paper shows 8B models *do* eventually copy the error and flip labels given enough noisy examples.

Key Novelty

Dual-Process Analysis of CoT (ICL vs. Priors)

Conducts fine-grained lexical analysis (counting verbs, structure words) to prove models mimic reasoning *structure* from prompts while relying on priors for *content*
Uses 'False-Answer' and 'False-Rationale' stress tests to quantify the exact point where ICL signals override pretrained semantic priors
Demonstrates that 'slow thinking' (extended reasoning chains) can be induced in standard LLMs purely through prompt engineering, without architectural changes

Architecture

Conceptual diagram illustrating the dual interaction between In-Context Learning (ICL) and Pretrained Priors during Chain-of-Thought reasoning.

Evaluation Highlights

In open-domain tasks (GSM8K, Date Understanding), providing 40 noisy/incorrect exemplars causes reasoning accuracy to drop by nearly half, showing ICL eventually overpowers priors.
Contradicting prior beliefs, smaller models (8B) *can* learn to systematically flip labels in closed-domain tasks (Coin Flip) when provided with sufficient counter-factual CoT exemplars.
Reasoning performance correlates with the density of 'reasoning verbs' in the output; optimal verb counts exist, beyond which performance degrades.

Breakthrough Assessment

7/10

Provides strong empirical evidence resolving conflicts about CoT mechanisms (imitation vs. learning) and successfully demonstrates induced 'slow thinking' via prompting, though it relies on analysis of existing models rather than a new architecture.

⚙️ Technical Details

Problem Definition

Setting: Few-shot Chain-of-Thought prompting where a model generates a response (r, a) conditioned on exemplars (qi, ri, ai) and target question q.

Inputs: Target question q and N exemplars (question, rationale, answer)

Outputs: Generated rationale r and answer a

Pipeline Flow

Prompt Construction (Standard, Noisy, or Long-CoT)
Inference (Greedy Decoding)
Output Analysis (Lexical Parsing or Accuracy Check)

System Modules

Prompt Constructor

Generates prompts with varying noise levels (False-Answer/Rationale) or length (Long CoT)

Model or implementation: Distilled from GPT-4 or DeepSeek-R1 (for Long CoT)

Reasoning Engine

Generates reasoning chain and final answer

Model or implementation: LLaMA3.1-8B, Gemma2-9B/27B, or Qwen2.5-32B

Novel Architectural Elements

No architectural changes; novel contribution is the analytical framework (Lexical Analysis of Rationales) and the 'Slow Thinking' prompting strategy derived from R1/QwQ models.

Modeling

Base Model: LLaMA3.1-8B, Gemma2-9B, Gemma2-27B, Qwen2.5-32B

Comparison to Prior Work

vs. Min et al. (2022): This paper finds that while label correctness matters less at low-shot, it matters significantly at high-shot (40+ exemplars), where noise degrades performance significantly.
vs. Wei et al. (2022): Investigates the *mechanism* (lexical structure vs. content) rather than just performance.
vs. DeepSeek-R1 [not cited in paper's methodology section as a baseline, but used as a source]: Uses standard models to emulate R1's 'slow thinking' via prompting rather than RL training.

Limitations

Analysis relies on greedy decoding, ignoring potential diversity from sampling
Long CoT experiments show performance degradation if the chain becomes *too* long for the model's capacity
Limited to arithmetic, commonsense, and symbolic reasoning tasks; implies but does not prove generalization to code or creative writing
Specific quantitative results for the 'Slow Thinking' performance (Table 1) are described in trends but raw numbers are not explicitly discussed in the text body

Reproducibility

Prompt strategies (False-Answer/Rationale construction) and lexical analysis methods (NLTK-based) are described. Code is not provided. Pretrained models used are public (HuggingFace).

📊 Experiments & Results

Evaluation Setup

Few-shot and Zero-shot prompting on standard reasoning benchmarks

Benchmarks:

GSM8K (Arithmetic Reasoning)
MATH-500 (Mathematical Reasoning)
Date Understanding (Commonsense Reasoning)
Coin Flip (Symbolic/Probabilistic Reasoning)
Last Letters Concatenation (Symbolic Reasoning)

Metrics:

Accuracy
Token generation probability (confidence)
Number of reasoning verbs (lexical analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments with 'False-Answer' and 'False-Rationale' prompts show that while models are robust to small amounts of noise (relying on priors), large amounts of noise (40-shot) cause ICL to override priors, degrading performance.
GSM8K	Accuracy	High (implied from trend)	~50% of baseline	-50% relative (approx)
Lexical analysis reveals that models adopt the 'reasoning structure' (verbs/structure words) from exemplars even when the content is task-agnostic.
GSM8K	Verb Count	High verb count	Lower, optimized verb count	Reduced

Experiment Figures

Lexical analysis of generated output (structure words, feature words, verbs) across different prompt types (Zero-shot, CoT, Task-Agnostic CoT).

Accuracy trends on Coin Flip, Date, and GSM8K as the number of noisy exemplars (False-Answer or False-Rationale) increases from few-shot to 40-shot.

Token generation probabilities (confidence) over time steps for CoT vs. False-Answer/Rationale CoT.

Main Takeaways

Models learn the *format* and *structure* of reasoning (verbs, transition words) from In-Context Learning (ICL), but heavily rely on *pretrained priors* for the actual semantic content.
The 'Priors vs. ICL' balance is quantity-dependent: with few exemplars (<5), priors dominate (high robustness to noise); with many exemplars (>40), ICL signals dominate (susceptibility to noise/errors).
Prompt engineering can induce 'slow thinking' (longer reasoning chains) in standard models, improving downstream performance, though there is an optimal length limit relative to model size.
Even small models (8B) can learn to flip labels/reasoning logic if provided with enough counter-intuitive exemplars, refuting claims that this capability requires massive scaling (66B+).

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
In-Context Learning (ICL)
Language Model decoding strategies (Greedy decoding)
Basic NLP lexical analysis (verbs, entities)

Key Terms

Chain-of-Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps (rationales) before the final answer

In-Context Learning (ICL): The ability of a model to learn a task from a few examples provided in the prompt at inference time, without weight updates

Pretrained Priors: Knowledge and patterns encoded in the model's weights during pre-training, often reflecting common sense or factual world knowledge

Slow Thinking: Also known as test-time scaling; a mode where the model generates longer, more detailed reasoning chains to improve accuracy on complex tasks

False-Answer CoT: A robustness test where prompt exemplars have correct reasoning but incorrectly flipped final answers

False-Rationale CoT: A robustness test where prompt exemplars have correct final answers but logically flawed or noisy reasoning steps

Greedy Decoding: A decoding strategy where the model selects the highest-probability token at each step

Task-agnostic CoT: Using reasoning exemplars from a completely different domain (e.g., using Sports examples to prompt for Math) to test if the model just copies format