Yoonjeong Park, Hyunjin Kim, Chanyeol Choi, Junseong Kim, Jy-yong Sohn
Yonsei University,
Linq
arXiv
(2024)
ReasoningBenchmark
📝 Paper Summary
Prompt EngineeringChain-of-Thought (CoT)
CoT-Sep improves LLM reasoning by inserting simple text separators between few-shot exemplars in prompts, effectively chunking information to reduce cognitive overload.
Core Problem
Standard Chain-of-Thought (CoT) prompts pack few-shot exemplars into dense blocks of text, causing 'cognitive overload' for LLMs and making it difficult to distinguish and process individual reasoning steps.
Why it matters:
Densely formatted prompts limit the model's ability to analyze information efficiently, mimicking human limitations in processing unchunked data.
Existing methods to improve CoT often require expensive iterative calls or complex external modules, whereas formatting changes are computationally free.
Optimizing the structural presentation of prompts is a low-resource way to unlock latent reasoning capabilities in existing models.
Concrete Example:In a standard CoT prompt, the answer to Question 1 runs immediately into Question 2 without a visual break. This can confuse the model (e.g., mixing the previous answer with the next question), whereas CoT-Sep inserts '###' or '\n\n\n' to clearly demarcate where one example ends and the next begins.
Key Novelty
CoT-Sep (Separated Chain-of-Thought)
Strategically inserts text separators (like newlines, hashes, or HTML tags) at the end of each few-shot exemplar in the prompt.
Leverages the psychological concept of 'chunking' to help LLMs segment information into manageable portions, enhancing comprehension and reasoning accuracy.
Architecture
Conceptual comparison between Vanilla CoT (densely structured) and CoT-Sep (structured with separators).
Evaluation Highlights
+5.1% accuracy improvement on GSM8K (math reasoning) using GPT-4-Turbo with TripleSkip separators compared to vanilla CoT.
+2.8% accuracy improvement on AQuA (complex math) using GPT-3.5-Turbo with TripleSkip separators.
Consistently outperforms vanilla CoT across LLaMA-2-7B, GPT-3.5, and GPT-4, particularly on more challenging datasets like AQuA.
Breakthrough Assessment
4/10
A simple but effective prompting heuristic. While not a fundamental architectural shift, it offers significant performance gains (up to 5%) with zero computational overhead, highlighting the importance of prompt formatting.
⚙️ Technical Details
Problem Definition
Setting: Few-shot in-context learning for complex reasoning tasks (arithmetic and commonsense)
Inputs: A natural language prompt containing k exemplars (question + step-by-step solution) followed by a target question
Outputs: A step-by-step reasoning chain and final answer for the target question
Pipeline Flow
Prompt Construction (Exemplars + Separators)
Inference (LLM Generation)
System Modules
Prompt Constructor
Assembles the few-shot prompt by appending a specific separator string after the answer of each exemplar
Model or implementation: Rule-based formatting
Reasoning Engine
Generates the reasoning path and final answer based on the formatted prompt
Model or implementation: LLM (e.g., GPT-4, LLaMA-2)
Novel Architectural Elements
Structured formatting of in-context exemplars using explicit separator tokens (architectural in the sense of prompt structure design)
Modeling
Base Model: Evaluated on GPT-3.5-Turbo-0613, GPT-4-0613, GPT-4-0125-preview (GPT-4-Turbo), and LLaMA-2-7B
Comparison to Prior Work
vs. Vanilla CoT: Adds structural separators between exemplars to improve readability and chunking.
vs. Complex CoT variants (e.g., Self-Consistency): CoT-Sep is a formatting intervention that requires no multiple sampling or external verifiers, making it computationally cheaper.
vs. Structured Prompting [not cited in paper]: CoT-Sep focuses specifically on the delimiter between few-shot examples rather than internal structure of the reasoning itself.
Limitations
Effectiveness of specific separators varies by model and task; no single separator is universally optimal.
Performance gains are higher on challenging tasks (like AQuA) and marginal on easier ones.
Requires careful placement of separators; placing them within sentences (rather than between exemplars) degrades performance.
Study limited to arithmetic and commonsense reasoning benchmarks.
Few-shot prompting on arithmetic and commonsense reasoning tasks
Benchmarks:
GSM8K (Grade School Math Reasoning)
AQuA (Algebra Question Answering (Complex Math))
CSQA (Commonsense Question Answering)
Metrics:
Accuracy
Statistical methodology: Reported statistics of accuracy values over 3 trials
Experiment Figures
Visualization of separator placement strategies: 'Unit: Exemplar' vs 'Unit: Sentence'.
Main Takeaways
Adding separators (CoT-Sep) consistently improves performance over vanilla CoT, with gains of up to 5.1% on GPT-4-Turbo (GSM8K).
TripleSkip (\n\n\n) is generally the most effective separator, though Heterogeneous CoT-Sep (cycling different separators) also outperforms vanilla CoT, offering a robust default.
Placement matters: Separators must be placed at the end of exemplars (Unit: Exemplar). Placing them between sentences within an exemplar (Unit: Sentence) harms performance by breaking the logical flow.
The method is most beneficial for harder tasks (e.g., AQuA) where the baseline model struggles, supporting the hypothesis that chunking aids in complex cognitive processing.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Large Language Models (LLMs)
Familiarity with Chain-of-Thought (CoT) prompting
Basic concept of In-Context Learning (ICL)
Key Terms
CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.
ICL: In-Context Learning—the ability of a model to learn a task from a few examples provided in the prompt without parameter updates.
Exemplar: A single example (input-output pair) included in the prompt to demonstrate the task to the model.
TripleSkip: A specific separator consisting of three newline characters (\n\n\n).
TripleHash: A specific separator consisting of three hash symbols (###).
Heterogeneous CoT-Sep: A variant of the method where different types of separators are cycled through after distinct exemplars within the same prompt.