RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought

📝 Paper Summary

Hallucination suppression Verification

RCoT improves arithmetic reasoning by asking LLMs to reconstruct the original problem from their generated solution, comparing the reconstruction to the original to detect and correct factual errors.

Core Problem

Large Language Models often hallucinate conditions, overlook constraints, or misinterpret questions during arithmetic Chain-of-Thought reasoning, leading to incorrect answers despite plausible logic.

Why it matters:

Existing self-verification methods usually provide only coarse-grained feedback (e.g., 'answer is wrong') without explaining why, failing to guide specific corrections
Factual inconsistencies in reasoning steps render LLMs unreliable for complex problem-solving where precision is critical
LLMs struggle to maintain consistency between problem conditions and reasoning steps, often hallucinating numbers or constraints not present in the input

Concrete Example: In a problem stating a meeting is 'tomorrow, 10/16/1924' (so today is 10/15), ChatGPT overlooks 'tomorrow' and calculates based on 10/16 being today. A standard checker might just say 'wrong', but RCoT explicitly flags: 'You overlooked the condition that the meeting is tomorrow.'

Key Novelty

Reverse Chain-of-Thought (RCoT)

Ask the LLM to reconstruct the problem statement based solely on its generated solution
Decompose both the original and reconstructed problems into structured lists of conditions and compare them item-by-item
Generate fine-grained textual feedback identifying specific hallucinations or overlooked conditions to guide the LLM in revising its answer

Architecture

The four-step RCoT framework: Reconstruction, Decomposition, Comparison, and Revision.

Evaluation Highlights

+4.1% accuracy gain on AQuA dataset (ChatGPT) compared to standard Chain-of-Thought
+5.0% accuracy gain on Date dataset (ChatGPT) compared to standard Chain-of-Thought
Outperforms Self-Consistency on GSM8K (82.0% vs 81.6%) using significantly fewer inference trials (1 vs 30)

Breakthrough Assessment

7/10

Novel approach to self-correction via problem reconstruction. Strong results on hard arithmetic tasks with lower compute than voting methods, but limited to arithmetic/logic domains so far.

⚙️ Technical Details

Problem Definition

Setting: Arithmetic reasoning where an LLM generates a step-by-step solution c for a problem Q

Inputs: Natural language arithmetic problem Q

Outputs: Revised solution c_revised and final answer

Pipeline Flow

Generator (Produce initial solution)
Reconstructor (Reconstruct problem from solution)
Comparator (Decompose and compare original vs. reconstructed problems)
Revisor (Generate new solution based on feedback)

System Modules

Generator

Generate initial step-by-step solution to the problem

Model or implementation: ChatGPT or LLaMA-13B-Chat

Reconstructor (Inconsistency Detection)

Reconstruct the problem statement based strictly on the generated solution

Model or implementation: ChatGPT or LLaMA-13B-Chat

Comparator (Inconsistency Detection)

Decompose both problems into condition lists and compare them to find discrepancies

Model or implementation: ChatGPT or LLaMA-13B-Chat

Revisor

Revise the solution using the fine-grained feedback

Model or implementation: ChatGPT or LLaMA-13B-Chat

Novel Architectural Elements

Fine-grained comparison via problem decomposition: breaking unstructured problem text into discrete condition lists for element-wise verification
Reverse reconstruction loop: validating reasoning by checking if the solution implies the original problem

Modeling

Base Model: ChatGPT (closed-source) and LLaMA-13B-Chat

Comparison to Prior Work

vs. Self-Verification: RCoT provides fine-grained text feedback on *which* condition was violated, rather than just a binary label
vs. Self-Refine: RCoT explicitly detects factual errors via reconstruction, whereas Self-Refine relies on the model's generic ability to critique itself
vs. Self-Consistency: RCoT focuses on fixing the reasoning chain itself rather than statistical aggregation, requiring fewer samples
+ 1 more
vs. ROSCOE [not cited in paper]: RCoT is a prompting strategy for correction, while ROSCOE is a metric suite for evaluating step-by-step reasoning quality

Limitations

Depends on the LLM's ability to accurately reconstruct and decompose problems; failure in these steps leads to bad feedback
Higher inference cost (latency) due to multiple steps (reconstruction, decomposition, comparison, revision) compared to standard CoT
Performance drops on multiple-choice tasks (AQuA, Date) when combined with Self-Consistency because models can guess answers with wrong logic
Currently focused on arithmetic reasoning; applicability to open-ended generation or non-logical tasks is untested

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot Chain-of-Thought arithmetic reasoning

Benchmarks:

GSM8K (Math word problems)
AQuA (Algebra word problems (multiple choice))
SVAMP (Math word problems with varying difficulty)
AddSub (Addition and subtraction problems)
ASDiv (Diverse math word problems)
Date (Date understanding and reasoning)
SingleEq (Single equation math problems)

Metrics:

Accuracy
Statistical methodology: Reported average accuracy with standard deviation across three test sub-sets (256 inputs each)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance comparison using ChatGPT backbone. RCoT consistently improves over standard CoT and Double-Check baselines.
GSM8K	Accuracy	79.0	82.0	+3.0
AQuA	Accuracy	51.3	55.5	+4.2
Date	Accuracy	66.7	71.7	+5.0
SVAMP	Accuracy	76.7	79.6	+2.9
Comparison with Self-Consistency (SC) and Self-Refine. RCoT outperforms SC on GSM8K with fewer samples.
GSM8K	Accuracy	81.6	82.0	+0.4
GSM8K	Accuracy	80.7	82.0	+1.3

Main Takeaways

Fine-grained feedback is critical: removing reasons from the feedback drop performance below the baseline (e.g., -2.7% on GSM8K), showing that generic 'double check' prompts can be harmful.
Manual feedback upper bound: Humans writing fine-grained feedback allows ChatGPT to reach 94.6% on GSM8K, suggesting the correction mechanism is strong if detection is accurate.
Decomposition is essential: Coarse-grained comparison between original and reconstructed problems fails to detect errors; decomposing into condition lists is necessary for performance.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Basic understanding of LLM hallucination types (hallucination, overlooking, misinterpretation)

Key Terms

RCoT: Reverse Chain-of-Thought—a method that detects reasoning errors by asking the model to reconstruct the problem from its solution and comparing it to the original

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

Self-Consistency: A decoding strategy that samples multiple reasoning paths and selects the most consistent answer via majority vote

Self-Refine: An iterative framework where an LLM critiques and improves its own output

Hallucination: The generation of content (conditions or numbers) that is not supported by the input source

Condition Overlooking: Failing to incorporate a specific constraint or number from the problem statement into the reasoning process

Condition Hallucination: Inventing constraints or numbers in the reasoning process that do not exist in the original problem

Question Misinterpretation: Answering a different question than what was asked (e.g., calculating duration instead of end time)