Enhancing Chain of Thought Prompting in Large Language Models via Reasoning Patterns

📝 Paper Summary

Chain of Thought (CoT) Prompting In-Context Learning (ICL)

Pattern-CoT enhances reasoning by selecting demonstrations based on diverse logical reasoning patterns (operational sequences) rather than semantic similarity, mitigating noise and improving interpretability.

Core Problem

Existing unsupervised CoT methods select demonstrations based on semantic similarity of questions, which introduces irrelevant noise and obscures the actual logical steps needed for reasoning.

Why it matters:

Semantic similarity often fails to capture the underlying logical structure required to solve complex reasoning tasks (e.g., math problems)
Current selection methods lack interpretability, making it difficult to understand why specific demonstrations work or fail
Heuristic selection of the number of demonstrations ($k$) is inefficient; too many examples don't always help, and too few miss key patterns

Concrete Example: In a case study, Auto-CoT selects demonstrations based on question semantics (e.g., 'buying apples'), but the logic required is subtraction. Because the semantic match is superficial, the model gets distracted by irrelevant context and produces a wrong answer. Pattern-CoT selects a demonstration with the correct 'subtraction pattern' regardless of topic, leading to the correct result.

Key Novelty

Pattern-CoT (Pattern-based Chain of Thought)

Extracts 'reasoning patterns' (sequences of operation tokens like +, -, *) from rationales instead of embedding the full text
Clusters these patterns to identify distinct logical strategies available in the data, ensuring the demonstration set covers diverse reasoning types
Determines the optimal number of demonstrations dynamically based on the number of unique operation types found in the task

Architecture

The overall workflow of Pattern-CoT. It illustrates the transition from Rationales -> Pattern Discovery -> Pattern Clustering -> Demonstration Selection -> Final Inference.

Evaluation Highlights

Outperforms Auto-CoT by +3.9% on MultiArith and +6.0% on SVAMP using LLaMA-2-7B
Achieves superior performance on Coin-Flip (+7.0% vs Auto-CoT) where operation patterns are not explicitly defined but discovered via LLM
Demonstrates robustness across model sizes, improving LLaMA-2-13B performance on AddSub by +2.5% over Auto-CoT

Breakthrough Assessment

7/10

Offers a logical, interpretable shift from semantic to pattern-based selection in CoT. While the gains are consistent, the method relies on identifying 'operations' which may be harder for non-algorithmic tasks.

⚙️ Technical Details

Problem Definition

Setting: Few-shot Chain-of-Thought prompting for reasoning tasks

Inputs: A test question $Q$ and a pool of unlabeled training questions/rationales

Outputs: A prompt containing selected demonstrations followed by the test question, enabling the LLM to generate the final answer

Pipeline Flow

Rationale Generation
Pattern Discovery
Pattern Clustering & Selection
Inference

System Modules

Rationale Generator

Generate reasoning steps for the training set using Zero-Shot-CoT if rationales are missing

Model or implementation: LLaMA-2-7B (or similar LLM)

Pattern Extractor

Identify specific operation tokens (e.g., +, -, or custom tokens) in rationales to form reasoning patterns

Model or implementation: Rule-based or LLM-guided (GPT-4)

Demonstration Selector

Cluster patterns to find diverse reasoning types and select representative examples

Model or implementation: Sentence-BERT (encoder) + k-means (clustering)

Inference Engine

Generate final answers using the constructed CoT prompt

Model or implementation: LLaMA-2-7B / 13B / GPT-3.5-turbo / Qwen-7B

Novel Architectural Elements

Pattern-based clustering: Clusters examples based on the sequence of logical operations (reasoning patterns) rather than the semantic embedding of the question text
Adaptive k-determination: Dynamically calculates the number of demonstrations ($k$) based on the diversity of discovered operation tokens ($k = \lceil N_{ops} / 2 \rceil + \lceil \log(Sample Size) \rceil$)

Modeling

Base Model: LLaMA-2-7B and LLaMA-2-13B (primary evaluation); GPT-3.5-turbo and Qwen-7B (scalability checks)

Comparison to Prior Work

vs. Auto-CoT: Auto-CoT relies on question semantics which introduces noise; Pattern-CoT relies on extracted reasoning paths (operations), improving logical relevance.
vs. Random-CoT: Random selection misses diverse patterns; Pattern-CoT explicitly maximizes pattern diversity via clustering.
vs. Complexity-based CoT [not cited in paper]: Complexity methods select hard examples (many steps); Pattern-CoT selects diverse *types* of operations, regardless of length.

Limitations

Dependency on accurate extraction of operation tokens; if the task has no clear 'operations' (e.g., creative writing), pattern extraction is difficult.
Requires an initial pass to generate rationales (Zero-Shot) if they don't exist, which adds computational cost.
The method focuses on unsupervised selection; the quality of the generated rationales still limits the upper bound (demonstrations may be incorrect, though the paper argues pattern matters more than correctness).

Reproducibility

Code: https://github.com/Magicat128/Pattern-CoT

📊 Experiments & Results

Evaluation Setup

Few-shot CoT reasoning on 8 diverse datasets

Benchmarks:

MultiArith (Arithmetic Reasoning)
GSM8K (Grade School Math)
AddSub (Arithmetic (Addition/Subtraction))
AQUA-RAT (Algebraic Reasoning (Multiple Choice))
SingleEq (Single Equation Math)
SVAMP (Math Word Problems)
Coin-Flip (Symbolic Reasoning)
BIG-bench Date Understanding (Logical Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pattern-CoT consistently outperforms baselines on LLaMA-2-7B across various reasoning tasks.
MultiArith	Accuracy	90.8	94.7	+3.9
SVAMP	Accuracy	63.7	69.7	+6.0
Coin-Flip	Accuracy	46.4	53.4	+7.0
GSM8K	Accuracy	39.4	40.9	+1.5
Ablation study on demonstration subsets shows that covering the full set of operations is crucial.
GSM8K	Accuracy	38.5	40.9	+2.4
MultiArith	Accuracy	89.3	95.5	+6.2

Experiment Figures

Perturbation-based attribution analysis comparing Auto-CoT and Pattern-CoT.

Impact of different reasoning pattern subsets (Basic vs. Full operations) on GSM8K and AQuA accuracy.

Main Takeaways

Pattern-CoT outperforms semantic-based selection (Auto-CoT) consistently, especially on tasks with clear operational logic (arithmetic, symbolic).
Using an adaptive number of demonstrations ($k$) based on operation counts further improves performance on complex tasks like GSM8K and AQuA.
Analysis reveals that Pattern-CoT works even when selected demonstrations contain incorrect answers, confirming that the *pattern* of reasoning is more important than the correctness of the example.
Attribution analysis (perturbation-based) shows that Pattern-CoT helps the model attend to relevant logical tokens rather than getting distracted by irrelevant semantic context.

📚 Prerequisite Knowledge

Prerequisites

Chain of Thought (CoT) Prompting
In-Context Learning (ICL)
Clustering algorithms (k-means)
Sentence embeddings (Sentence-BERT)

Key Terms

CoT: Chain of Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

Reasoning Patterns: The sequence of logical operations (e.g., 'addition -> multiplication') used to arrive at a solution, extracted from the rationale

Rationale: The intermediate reasoning text generated by an LLM that explains how to get from the question to the answer

Zero-Shot-CoT: Prompting the model with just 'Let's think step by step' without providing specific examples

Auto-CoT: A baseline method that clusters questions by semantic similarity to select diverse demonstrations automatically

Attribution Analysis: A method to visualize which parts of the input (tokens) contributed most to the model's output generation