Contrastive Chain-of-Thought Prompting

📝 Paper Summary

Prompt Engineering Chain-of-Thought Reasoning In-Context Learning

Contrastive Chain-of-Thought enhances LLM reasoning by including both valid and automatically generated invalid reasoning demonstrations in the prompt, teaching models to avoid common logic errors.

Core Problem

Conventional Chain-of-Thought (CoT) provides only correct examples, failing to inform models about potential mistakes; surprisingly, models are often robust to invalid CoT, suggesting they may ignore the reasoning process entirely.

Why it matters:

Mistakes in intermediate reasoning steps can compound, leading to incorrect final answers and hallucinations
Prior work shows LLMs pay little attention to the validity of reasoning chains, undermining the trustworthiness of the generated explanations
Standard prompts do not explicitly guide the model on what logic faults to avoid, missing the human-like learning process of learning from negative examples

Concrete Example: In a math problem about dentist visits, a standard CoT prompt might only show the correct subtraction. A model might then hallucinate a different number. Contrastive CoT explicitly shows a 'Wrong Explanation' where the subtraction is done incorrectly (e.g., subtracting from the wrong total), teaching the model to distinguish correct from incorrect operations.

Key Novelty

Contrastive Chain-of-Thought (Contrastive CoT)

Augments few-shot prompts with 'contrastive' pairs: a correct reasoning chain followed by an incorrect one for the same question
Introduces an automatic method to generate negative demonstrations by extracting entities (bridging objects) from valid chains and shuffling them to create incoherent reasoning

Architecture

A comparison of the prompt structure for Contrastive Chain-of-Thought versus conventional methods

Evaluation Highlights

+16.0% accuracy improvement on Bamboogle (factual QA) using GPT-3.5-Turbo compared to conventional Chain-of-Thought
+9.8% accuracy improvement on GSM8K (arithmetic reasoning) using GPT-3.5-Turbo compared to conventional Chain-of-Thought
When combined with Self-Consistency, gains increase further (e.g., +17.6% on Bamboogle over CoT-SC)

Breakthrough Assessment

7/10

Simple yet highly effective prompting strategy that generalizes well across tasks. The automatic generation of negative examples makes it practical and scalable without manual annotation.

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning for reasoning tasks

Inputs: A query Q and a set of demonstration examples D containing questions, correct rationales, correct answers, incorrect rationales, and incorrect answers

Outputs: A generated reasoning chain T and final answer A

Pipeline Flow

Demonstration Construction: Generate negative rationales from valid ones via object shuffling
Prompt Assembly: Format prompt with (Question, Correct Rationale, Correct Answer, Incorrect Rationale, Incorrect Answer)
Inference: Query LLM with contrastive prompt

System Modules

Negative Rationale Generator

Automatically creates invalid reasoning examples from valid ones

Model or implementation: SpaCy (en_core_web_trf) for NER

LLM Inference

Generates the final answer using the contrastive demonstrations

Model or implementation: GPT-3.5-Turbo (0301)

Novel Architectural Elements

Contrastive Prompt Structure: Prompt schema explicitly separates 'Explanation' (valid) and 'Wrong Explanation' (invalid) within the same few-shot example

Modeling

Base Model: GPT-3.5-Turbo (version 0301)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Chain-of-Thought: CoT uses only valid demonstrations (Q, T, A). Contrastive CoT uses both valid and invalid (Q, T+, A+, T-, A-).
vs. Standard Prompting: Standard prompting uses (Q, A) without reasoning. Contrastive CoT includes reasoning.
vs. Least-to-Most Prompting [not cited in paper]: Least-to-Most decomposes problems into sub-questions; Contrastive CoT focuses on validating reasoning steps via negative examples.

Limitations

Relies on existing valid reasoning chains to generate negative ones (requires CoT annotations)
The automatic shuffling method specifically targets 'Incoherent Objects' errors; may not cover other logic error types like calculation errors or irrelevant steps
Experiments limited to GPT-3.5-Turbo; impact on other model families or sizes not explored in detail

Reproducibility

Code: https://github.com/DAMO-NLP-SG/contrastive-cot

Prompt format and logic for generating negative examples are clearly described. Code is publicly available. Experiments use a closed-source API model (GPT-3.5-Turbo), which may change over time.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting (4-shot) on arithmetic and factual reasoning datasets

Benchmarks:

GSM8K (Arithmetic Reasoning)
Bamboogle (Factual QA)
StrategyQA (Factual QA)
SVAMP (Arithmetic Reasoning)
GSM-Hard (Arithmetic Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing Contrastive CoT against standard Prompting and conventional Chain-of-Thought (CoT) using GPT-3.5-Turbo.
GSM8K	Accuracy	69.2	79.0	+9.8
Bamboogle	Accuracy	40.8	56.8	+16.0
StrategyQA	Accuracy	55.8	66.2	+10.4
SVAMP	Accuracy	67.2	81.6	+14.4
Results when combining prompting methods with Self-Consistency (SC), a decoding strategy that takes the majority vote of multiple outputs.
GSM8K	Accuracy	71.0	86.2	+15.2
Bamboogle	Accuracy	40.8	58.4	+17.6

Main Takeaways

Contrastive CoT consistently outperforms conventional CoT across all evaluated arithmetic and factual reasoning benchmarks
The method is synergistic with Self-Consistency, yielding larger gains than CoT does when scaling decoding paths
Preliminary studies showed that 'Incoherent Objects' (shuffling entities/numbers) was the most effective type of invalid demonstration for improving reasoning, outperforming other error types like irrelevant language
The approach generalizes well, suggesting that teaching models what *not* to do is a powerful signal for reasoning tasks

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with In-Context Learning (ICL)
Basic knowledge of Self-Consistency decoding

Key Terms

Chain-of-Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

Contrastive Learning: A learning paradigm where the model learns to distinguish between positive (correct) and negative (incorrect) samples

Bridging objects: Symbolic items (like numbers, equations, or entity names) that the model traverses or manipulates to reach a solution

Self-Consistency: A decoding strategy that samples multiple reasoning paths and selects the most consistent answer (majority vote)

Incoherent Objects: A type of reasoning error where the numbers or entities used in the steps do not match the problem context or logical flow

Rationale: The sequence of intermediate reasoning steps generated by the model