Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

📝 Paper Summary

In-Context Learning (ICL) Robustness of LLMs Chain-of-Thought (CoT) Prompting

The paper investigates LLM vulnerability to irrelevant or inaccurate reasoning steps in prompts and proposes a contrastive denoising method that uses a single clean example to rectify noisy rationales.

Core Problem

Large Language Models (LLMs) are highly vulnerable to 'noisy rationales'—demonstrations containing irrelevant or inaccurate reasoning steps—significantly degrading performance even if the final answers in examples are correct.

Why it matters:

Noisy rationales are common in real-world data sources like crowd-sourced platforms, dialogue systems, and machine-generated data, making robustness critical for practical deployment.
Existing research focuses on noisy questions or answers, leaving the specific impact of flawed intermediate reasoning steps (Noisy-R) under-explored.
Standard robustness methods like self-consistency and self-correction fail to effectively handle noisy rationales, sometimes performing worse than base models.

Concrete Example: In a math problem using base-9 addition, a prompt might include a 'factually inaccurate' thought like '5+5=10' (incorrect in base-9). Alternatively, it might include 'irrelevant thoughts' about genetic overlap in a family relationship problem. These noises cause GPT-3.5's accuracy to drop by up to 40.4%.

Key Novelty

Contrastive Denoising with noisy Chain-of-Thought (CD-CoT)

Leverages a single clean demonstration to 'denoise' multiple noisy examples by asking the LLM to contrast the noisy rationales against the clean one.
follows an exploration-exploitation principle: first rephrasing and selecting high-quality rationales (explicit denoising), then generating diverse reasoning paths and voting on the final answer.

Architecture

The CD-CoT framework pipeline illustrating the four steps: Rationale Rephrasing, Rationale Selecting, Rationale Exploring, and Answer Voting.

Evaluation Highlights

GPT-3.5 accuracy drops significantly with inaccurate thoughts (up to -40.4%) compared to clean rationales, highlighting intrinsic vulnerability.
CD-CoT achieves an average accuracy improvement of 17.8% over the base model across three reasoning domains (Math, Symbolic, Commonsense).
CD-CoT outperforms self-consistency baselines like SC (Self-Consistency) and SD (Self-Denoise), showing stronger denoising capabilities with minimal supervision.

Breakthrough Assessment

7/10

Identifies a distinct, under-explored failure mode (noisy rationales vs. noisy questions) and provides both a benchmark (NoRa) and a practical solution (CD-CoT). The reliance on a clean example is a constraint but realistic.

⚙️ Technical Details

Problem Definition

Setting: Few-shot Chain-of-Thought prompting where demonstration rationales contain noise (irrelevant or inaccurate steps) but questions and final answers are correct.

Inputs: Test question x_test and a set of noisy few-shot examples S_n (containing questions, noisy rationales, and answers) plus one clean example.

Outputs: Predicted answer y_test

Pipeline Flow

Input: Noisy examples + 1 Clean example
Step 1: Rationale Rephrasing (LLM generates new rationales for noisy examples using the clean one as a guide)
Step 2: Rationale Selecting (LLM selects the best rationale from original vs. rephrased versions)
Step 3: Rationale Exploring (Generate multiple reasoning paths using the selected clean rationales)
Step 4: Answer Voting (Aggregating answers from diverse paths)

System Modules

Rationale Rephrasing (Denoising)

Generate corrected versions of the noisy rationales by contrasting them with the single clean example

Model or implementation: GPT-3.5-turbo-0613 (Base LLM)

Rationale Selecting (Denoising)

Choose the most coherent/accurate rationale between the original noisy one and the rephrased one

Model or implementation: GPT-3.5-turbo-0613 (Base LLM)

Rationale Exploring & Voting

Apply standard self-consistency (generating multiple paths and voting) using the cleaned prompt

Model or implementation: GPT-3.5-turbo-0613 (Base LLM)

Novel Architectural Elements

Contrastive Denoising Framework: Explicitly uses a 'clean vs. noisy' comparison step to rectify prompt examples before inference, rather than relying on implicit model robustness.

Modeling

Base Model: GPT-3.5-turbo-0613

Comparison to Prior Work

vs. SC/SD: CD-CoT explicitly repairs the *demonstrations* using a reference clean example, whereas SC only aggregates outputs and SD attempts to denoise without a reference standard.
vs. ISC/SP: CD-CoT focuses on correcting the *input rationales* (prompt engineering) rather than correcting the *output* iteratively.
vs. Auto-CoT [not cited in paper]: Auto-CoT generates rationales zero-shot to build prompts; CD-CoT assumes we have prompts but they are noisy/flawed.

Limitations

Requires at least one clean CoT demonstration (gold standard), which may not always be available.
Performance depends on the LLM's intrinsic ability to distinguish clean vs. noisy logic during the selection phase.
Involves multiple inference calls (rephrasing, selecting, exploring), increasing computational cost compared to standard prompting.

Reproducibility

Code: https://github.com/tmlr-group/NoisyRationales

Publicly available code at https://github.com/tmlr-group/NoisyRationales. Dataset NoRa is constructed from existing public datasets (Base Calculation, SCAN, CLUTRR). Prompt templates are implied to be in the code.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on reasoning tasks with injected noise (irrelevant or inaccurate thoughts) in the examples.

Benchmarks:

NoRa-Math (Arithmetic reasoning (Base-9 and Base-11 addition)) [New]
NoRa-Symbolic (Symbolic instruction following (SCAN dataset derived)) [New]
NoRa-Commonsense (Family relationship reasoning (CLUTRR dataset derived)) [New]

Metrics:

Accuracy (%)
Statistical methodology: Reported averages over 300 questions per task, repeated 5 times.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Vulnerability analysis shows massive drops in GPT-3.5 performance when rationales contain noise.
NoRa-Math (Base-9)	Accuracy	63.0	22.6	-40.4
NoRa-Commonsense (Hard)	Accuracy	53.3	33.5	-19.8
Main method performance (CD-CoT) compared to Base model and Self-Consistency (SC) baseline.
Average across all tasks	Accuracy	Not reported in the paper	Not reported in the paper	+17.8

Experiment Figures

Bar charts comparing accuracy of Base model, SC (Self-Consistency), and CD-CoT across three datasets (Math, Symbolic, Commonsense) under clean, irrelevant, and inaccurate noise conditions.

Main Takeaways

LLMs are intrinsically vulnerable to noisy rationales: inaccurate thoughts (factual errors) are significantly more damaging than irrelevant thoughts.
Standard robust methods like Self-Correction and Self-Denoise (SD) are largely ineffective against noisy rationales, often performing worse than the base model because they disrupt valid logic or fail without external feedback.
Temperature adjustment is non-linear: lower temperatures generally help, but there are multiple accuracy peaks, suggesting complex behavior under noise.
The proposed CD-CoT effectively utilizes a single clean example to filter noise, validating the 'contrastive' hypothesis for rationale denoising.

📚 Prerequisite Knowledge

Prerequisites

In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting
Understanding of LLM temperature and sampling
Basic concepts of self-consistency and self-correction in LLMs

Key Terms

Noisy Rationales (Noisy-R): Intermediate reasoning steps in few-shot examples that are either factually inaccurate or contextually irrelevant, despite the final answer being correct.

CD-CoT: Contrastive Denoising with noisy Chain-of-Thought—the proposed method that uses one clean example to filter and correct noisy rationales.

Irrelevant thoughts: Reasoning steps that are factually correct but unhelpful for the specific question (e.g., reciting biological facts during a logic puzzle).

Inaccurate thoughts: Reasoning steps containing factual errors (e.g., calculation errors) but leading to the correct final label in the example.

Self-consistency (SC): A technique where the model generates multiple reasoning paths and selects the most consistent answer via voting.

ICL: In-Context Learning—the ability of models to learn from a few examples provided in the prompt without parameter updates.

NoRa: The benchmark dataset constructed in this paper, containing 26,391 questions across Math, Symbolic, and Commonsense domains with controlled noise ratios.