Cumulative Reasoning with Large Language Models

📝 Paper Summary

LLM Reasoning Frameworks Neuro-symbolic Reasoning

Cumulative Reasoning enhances LLM problem-solving by orchestrating Proposer, Verifier, and Reporter agents to iteratively build a Directed Acyclic Graph of validated reasoning steps.

Core Problem

LLMs struggle with complex multi-step reasoning because they primarily operate in a linear, 'System 1' intuitive mode, lacking mechanisms to systematically verify intermediate steps or leverage a growing context of proven facts.

Why it matters:

Linear Chain-of-Thought (CoT) suffers from error propagation; a single mistake invalidates the entire subsequent chain.
Tree-of-Thought (ToT) explores branches but does not explicitly accumulate and reuse all verified knowledge across different branches, leading to redundancy.
Current methods often conflate generation and verification, preventing rigorous error-checking necessary for tasks like math and logic.

Concrete Example: In the 'Game of 24' (making 24 from 4 numbers), CoT might halluciante an intermediate calculation (e.g., '8+8=18') and continue linearly to a wrong answer. Cumulative Reasoning would propose '8+8=16', verify it via code/logic, store it in a graph, and then use that '16' as a trusted premise for the next step.

Key Novelty

Cumulative Reasoning (CR) Framework

Decomposes reasoning into three roles: Proposer (suggests steps), Verifier (checks validity), and Reporter (decides when to finalize), emulating human deliberative thought.
Maintains a Directed Acyclic Graph (DAG) of *all* validated propositions, allowing the model to combine any previously verified facts to derive new ones, rather than following a single linear chain.

Evaluation Highlights

+24% accuracy improvement on the 'Game of 24' task compared to Tree-of-Thought (ToT), achieving 98% accuracy while visiting ~75% fewer states.
Achieved 98.04% accuracy on the curated FOLIO wiki logic dataset using GPT-4, reducing the error rate significantly compared to CoT-SC (96.09%).
+43% relative improvement on the hardest Level 5 MATH problems compared to Complex CoT (32.1% vs 22.4%), and 72.2% overall accuracy on MATH when integrated with a code environment.

Breakthrough Assessment

8/10

CR proposes a significant structural shift from linear chains/trees to cumulative graphs (DAGs) with distinct verification roles. The empirical gains on hard reasoning tasks (Game of 24, MATH) are substantial.

⚙️ Technical Details

Problem Definition

Setting: Complex reasoning tasks (logical inference, arithmetic, math) where the solution requires a sequence of correct intermediate derivations.

Inputs: A natural language problem statement (premises, numbers, or math problem).

Outputs: A final answer derived from a history of verified intermediate propositions.

Pipeline Flow

Input Problem
Iterative Loop: Propose Step -> Verify Step -> Update DAG
Check Termination (Reporter)
Output Solution

System Modules

Proposer

Generates candidate reasoning steps based on the current context (problem + DAG of verified facts).

Model or implementation: GPT-4, GPT-3.5-turbo, or LLaMA-65B (via few-shot prompting)

Verifier

Critically assesses validity of proposed steps. Can be an LLM (self-critique) or an external tool (Python interpreter).

Model or implementation: Same LLM as Proposer (via prompting) OR Python Code Environment

Reporter

Monitors the DAG to determine if sufficient information exists to answer the question.

Model or implementation: Same LLM as Proposer (via prompting)

Novel Architectural Elements

Dynamic Directed Acyclic Graph (DAG) memory: Unlike trees (ToT) or chains (CoT), CR stores all verified steps in a shared pool that can be continuously referenced.
Explicit Role Orchestration: Formal separation of Proposer, Verifier, and Reporter duties within the inference loop.

Modeling

Base Model: GPT-4 (primary), GPT-3.5-turbo, LLaMA-13B, LLaMA-65B

Comparison to Prior Work

vs. ToT: CR builds a cumulative DAG rather than separate tree branches, allowing 'cousin' nodes to combine. CR achieves higher accuracy with fewer visited states on Game of 24.
vs. CoT: CR includes explicit verification steps and non-linear context accumulation.
vs. PAL/PoT: CR with code uses the interpreter specifically as a *Verifier* for LLM proposals, rather than just delegating the whole calculation to code.

Limitations

Computational cost is higher than simple CoT due to multiple Proposer/Verifier calls per problem.
Performance depends heavily on the underlying LLM's capability to act as a Verifier (unless external tools like Python are used).
The DAG context can grow large, potentially hitting context window limits for very long reasoning chains.

Reproducibility

Code: https://github.com/iiis-ai/cumulative-reasoning

Code is publicly available on GitHub. The method relies on prompt engineering and standard API calls (OpenAI) or open weights (LLaMA). Experiments use standard datasets (FOLIO, MATH, Game of 24). Reproduction requires access to competent LLMs (GPT-4 class recommended for hardest tasks).

📊 Experiments & Results

Evaluation Setup

Evaluated on logical inference, arithmetic search, and competitive mathematics problems.

Benchmarks:

FOLIO wiki (First-order logic inference)
Game of 24 (Arithmetic search / Constraint satisfaction)
MATH (Mathematical reasoning (Algebra, Geometry, etc.))
AutoTNLI (Tabular Natural Language Inference)

Metrics:

Accuracy
Number of visited states (Efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on Logic Tasks (FOLIO) show CR outperforming CoT methods, especially when data is curated.
FOLIO-wiki	Accuracy	85.02	87.45	+2.43
FOLIO-wiki-curated	Accuracy	96.09	98.04	+1.95
Game of 24 results demonstrate superior search efficiency and success rate compared to Tree-of-Thought.
Game of 24	Accuracy	74	98	+24
Game of 24	# Visited States	61.72	14.86	-46.86
MATH benchmark results show CR enhances mathematical reasoning, particularly when combined with a code environment.
MATH	Overall Accuracy	53.80	58.00	+4.20
MATH (Level 5)	Accuracy	22.4	32.1	+9.7
MATH	Overall Accuracy (w/ Code)	61.6	72.2	+10.6

Main Takeaways

Decomposition of reasoning into Proposer/Verifier/Reporter roles significantly improves performance over monolithic generation.
The cumulative DAG structure is more efficient than tree search (ToT), achieving higher accuracy with fewer visited states on search-intensive tasks.
Integration with external verifiers (e.g., Python code) drastically boosts performance on math tasks, surpassing previous code-aided methods like PAL and ToRA.
Ablation studies confirm that both the Verifier role and the cumulative context are essential for the performance gains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting strategies
Familiarity with Chain-of-Thought (CoT) and Tree-of-Thought (ToT)
Basic concepts of Directed Acyclic Graphs (DAGs) and propositional logic

Key Terms

DAG: Directed Acyclic Graph—a data structure where verified reasoning steps are nodes, allowing new steps to connect to any previous nodes without forming loops.

System 1 vs System 2: A cognitive theory distinction: System 1 is fast/intuitive (standard LLM generation), System 2 is slow/deliberative (CR's iterative verification process).

CoT-SC: Chain-of-Thought Self-Consistency—generating multiple reasoning chains and taking a majority vote on the answer.

FOLIO: A dataset for First-Order Logic Inference problems.

ToT: Tree-of-Thought—a prompting framework that explores multiple reasoning paths in a tree structure.

Greedy Decoding: Selecting the most likely next token at each step during text generation (temperature = 0).