← Back to Paper List

Cumulative Reasoning with Large Language Models

Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao
Institute for Interdisciplinary Information Sciences, Tsinghua University, Shanghai Qi Zhi Institute
arXiv (2023)
Reasoning Agent Benchmark

📝 Paper Summary

LLM Reasoning Frameworks Neuro-symbolic Reasoning
Cumulative Reasoning enhances LLM problem-solving by orchestrating Proposer, Verifier, and Reporter agents to iteratively build a Directed Acyclic Graph of validated reasoning steps.
Core Problem
LLMs struggle with complex multi-step reasoning because they primarily operate in a linear, 'System 1' intuitive mode, lacking mechanisms to systematically verify intermediate steps or leverage a growing context of proven facts.
Why it matters:
  • Linear Chain-of-Thought (CoT) suffers from error propagation; a single mistake invalidates the entire subsequent chain.
  • Tree-of-Thought (ToT) explores branches but does not explicitly accumulate and reuse all verified knowledge across different branches, leading to redundancy.
  • Current methods often conflate generation and verification, preventing rigorous error-checking necessary for tasks like math and logic.
Concrete Example: In the 'Game of 24' (making 24 from 4 numbers), CoT might halluciante an intermediate calculation (e.g., '8+8=18') and continue linearly to a wrong answer. Cumulative Reasoning would propose '8+8=16', verify it via code/logic, store it in a graph, and then use that '16' as a trusted premise for the next step.
Key Novelty
Cumulative Reasoning (CR) Framework
  • Decomposes reasoning into three roles: Proposer (suggests steps), Verifier (checks validity), and Reporter (decides when to finalize), emulating human deliberative thought.
  • Maintains a Directed Acyclic Graph (DAG) of *all* validated propositions, allowing the model to combine any previously verified facts to derive new ones, rather than following a single linear chain.
Evaluation Highlights
  • +24% accuracy improvement on the 'Game of 24' task compared to Tree-of-Thought (ToT), achieving 98% accuracy while visiting ~75% fewer states.
  • Achieved 98.04% accuracy on the curated FOLIO wiki logic dataset using GPT-4, reducing the error rate significantly compared to CoT-SC (96.09%).
  • +43% relative improvement on the hardest Level 5 MATH problems compared to Complex CoT (32.1% vs 22.4%), and 72.2% overall accuracy on MATH when integrated with a code environment.
Breakthrough Assessment
8/10
CR proposes a significant structural shift from linear chains/trees to cumulative graphs (DAGs) with distinct verification roles. The empirical gains on hard reasoning tasks (Game of 24, MATH) are substantial.
×