Can Separators Improve Chain-of-Thought Prompting?

📝 Paper Summary

Prompt Engineering Chain-of-Thought (CoT)

CoT-Sep improves LLM reasoning by inserting simple text separators between few-shot exemplars in prompts, effectively chunking information to reduce cognitive overload.

Core Problem

Standard Chain-of-Thought (CoT) prompts pack few-shot exemplars into dense blocks of text, causing 'cognitive overload' for LLMs and making it difficult to distinguish and process individual reasoning steps.

Why it matters:

Densely formatted prompts limit the model's ability to analyze information efficiently, mimicking human limitations in processing unchunked data.
Existing methods to improve CoT often require expensive iterative calls or complex external modules, whereas formatting changes are computationally free.
Optimizing the structural presentation of prompts is a low-resource way to unlock latent reasoning capabilities in existing models.

Concrete Example: In a standard CoT prompt, the answer to Question 1 runs immediately into Question 2 without a visual break. This can confuse the model (e.g., mixing the previous answer with the next question), whereas CoT-Sep inserts '###' or '\n\n\n' to clearly demarcate where one example ends and the next begins.

Key Novelty

CoT-Sep (Separated Chain-of-Thought)

Strategically inserts text separators (like newlines, hashes, or HTML tags) at the end of each few-shot exemplar in the prompt.
Leverages the psychological concept of 'chunking' to help LLMs segment information into manageable portions, enhancing comprehension and reasoning accuracy.

Architecture

Conceptual comparison between Vanilla CoT (densely structured) and CoT-Sep (structured with separators).

Evaluation Highlights

+5.1% accuracy improvement on GSM8K (math reasoning) using GPT-4-Turbo with TripleSkip separators compared to vanilla CoT.
+2.8% accuracy improvement on AQuA (complex math) using GPT-3.5-Turbo with TripleSkip separators.
Consistently outperforms vanilla CoT across LLaMA-2-7B, GPT-3.5, and GPT-4, particularly on more challenging datasets like AQuA.

Breakthrough Assessment

4/10

A simple but effective prompting heuristic. While not a fundamental architectural shift, it offers significant performance gains (up to 5%) with zero computational overhead, highlighting the importance of prompt formatting.

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning for complex reasoning tasks (arithmetic and commonsense)

Inputs: A natural language prompt containing k exemplars (question + step-by-step solution) followed by a target question

Outputs: A step-by-step reasoning chain and final answer for the target question

Pipeline Flow

Prompt Construction (Exemplars + Separators)
Inference (LLM Generation)

System Modules

Prompt Constructor

Assembles the few-shot prompt by appending a specific separator string after the answer of each exemplar

Model or implementation: Rule-based formatting

Reasoning Engine

Generates the reasoning path and final answer based on the formatted prompt

Model or implementation: LLM (e.g., GPT-4, LLaMA-2)

Novel Architectural Elements

Structured formatting of in-context exemplars using explicit separator tokens (architectural in the sense of prompt structure design)

Modeling

Base Model: Evaluated on GPT-3.5-Turbo-0613, GPT-4-0613, GPT-4-0125-preview (GPT-4-Turbo), and LLaMA-2-7B

Comparison to Prior Work

vs. Vanilla CoT: Adds structural separators between exemplars to improve readability and chunking.
vs. Complex CoT variants (e.g., Self-Consistency): CoT-Sep is a formatting intervention that requires no multiple sampling or external verifiers, making it computationally cheaper.
vs. Structured Prompting [not cited in paper]: CoT-Sep focuses specifically on the delimiter between few-shot examples rather than internal structure of the reasoning itself.

Limitations

Effectiveness of specific separators varies by model and task; no single separator is universally optimal.
Performance gains are higher on challenging tasks (like AQuA) and marginal on easier ones.
Requires careful placement of separators; placing them within sentences (rather than between exemplars) degrades performance.
Study limited to arithmetic and commonsense reasoning benchmarks.

Reproducibility

Code: https://github.com/cottonlove/CoT-SEP

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on arithmetic and commonsense reasoning tasks

Benchmarks:

GSM8K (Grade School Math Reasoning)
AQuA (Algebra Question Answering (Complex Math))
CSQA (Commonsense Question Answering)

Metrics:

Accuracy
Statistical methodology: Reported statistics of accuracy values over 3 trials

Experiment Figures

Visualization of separator placement strategies: 'Unit: Exemplar' vs 'Unit: Sentence'.

Main Takeaways

Adding separators (CoT-Sep) consistently improves performance over vanilla CoT, with gains of up to 5.1% on GPT-4-Turbo (GSM8K).
TripleSkip (\n\n\n) is generally the most effective separator, though Heterogeneous CoT-Sep (cycling different separators) also outperforms vanilla CoT, offering a robust default.
Placement matters: Separators must be placed at the end of exemplars (Unit: Exemplar). Placing them between sentences within an exemplar (Unit: Sentence) harms performance by breaking the logical flow.
The method is most beneficial for harder tasks (e.g., AQuA) where the baseline model struggles, supporting the hypothesis that chunking aids in complex cognitive processing.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs)
Familiarity with Chain-of-Thought (CoT) prompting
Basic concept of In-Context Learning (ICL)

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.

ICL: In-Context Learning—the ability of a model to learn a task from a few examples provided in the prompt without parameter updates.

Exemplar: A single example (input-output pair) included in the prompt to demonstrate the task to the model.

TripleSkip: A specific separator consisting of three newline characters (\n\n\n).

TripleHash: A specific separator consisting of three hash symbols (###).

Heterogeneous CoT-Sep: A variant of the method where different types of separators are cycled through after distinct exemplars within the same prompt.