Faithful Logical Reasoning via Symbolic Chain-of-Thought

📝 Paper Summary

Neuro-symbolic AI Logical Reasoning

SymbCoT enhances LLM logical reasoning by integrating symbolic expressions and rules into a plan-solve-verify Chain-of-Thought workflow entirely within the LLM, eliminating reliance on external solvers.

Core Problem

Standard Chain-of-Thought (CoT) struggles with precise logical calculations, while existing neuro-symbolic methods rely on external solvers that are brittle to syntax errors and lack interpretability.

Why it matters:

Pure LLMs often hallucinate in rigid logical deduction tasks or fail to track long inference chains.
External solvers (used in Logic-LM) fail completely if the LLM generates even slightly incorrect symbolic syntax.
Relying on external tools creates an opacity barrier, making it hard to explain why a specific logical conclusion was reached.

Concrete Example: In a logic puzzle about golf rankings, GPT-4 with standard CoT incorrectly infers 'Descampe is in the six-way tie' by affirming the consequent. SymbCoT translates the premises to First-Order Logic, realizes the necessary premise 'Tie(Descampe, sixWay)' is missing, and correctly concludes 'Unknown'.

Key Novelty

Fully LLM-based Symbolic Chain-of-Thought (SymbCoT)

Replaces the 'think step-by-step' heuristic with a structured translation-planning-solving pipeline where the LLM itself acts as the symbolic engine.
Uses a hybrid context of natural language and symbolic expressions to capture both nuance and rigid logic.
Incorporates a retrospective verifier that checks both the translation accuracy and the logical validity of the reasoning steps before finalizing the answer.

Architecture

The SymbCoT workflow pipeline showing the four main modules interacting with the LLM.

Evaluation Highlights

Achieves 83.33% accuracy on FOLIO with GPT-4, outperforming the external-solver-based Logic-LM (78.92%) and CoT (70.58%).
Attains 100% symbolic execution success rate on AR-LSAT, compared to only 67.4% for Logic-LM, demonstrating superior robustness to syntax errors.
Outperforms CoT by +5.37% on LogicalDeduction (Constraint Optimization) using GPT-4, showing generalization across symbolic formats.

Breakthrough Assessment

8/10

Significantly advances neuro-symbolic reasoning by proving LLMs can effectively perform symbolic deduction without external tools, solving the 'brittle solver' bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Given a set of premises P and a statement S, determine if S is True, False, or Unknown.

Inputs: Natural language premises P = {p1, ... pn} and question statement S.

Outputs: Label C ∈ {True, False, Unknown} and a verified reasoning chain.

Pipeline Flow

Translator (Natural Language → Symbolic)
Planner (Decompose problem into sub-steps)
Solver (Step-by-step deduction using hybrid context)
Verifier (Check translation and logic)

System Modules

Translator

Translate natural language premises and questions into symbolic formats (FOL or CO).

Model or implementation: GPT-3.5-turbo / GPT-4

Planner (Reasoning Core)

Create a step-by-step blueprint connecting premises to the conclusion using hybrid (text + symbol) context.

Model or implementation: GPT-3.5-turbo / GPT-4

Solver (Reasoning Core)

Derive the answer by sequentially applying valid logic rules (e.g., Modus Ponens) based on the plan.

Model or implementation: GPT-3.5-turbo / GPT-4

Verifier

Retrospectively check translation consistency and logic validity; refine if errors are found.

Model or implementation: GPT-3.5-turbo / GPT-4

Novel Architectural Elements

Plan-then-solve architecture specifically for symbolic reasoning within an LLM
Retrospective verification loop for both translation (symbolic grounding) and reasoning (logic validity)
Hybrid context utilization (concatenating natural language and symbolic forms) to prevent information loss

Modeling

Base Model: gpt-3.5-turbo-0613 and gpt-4-0613

Compute: Not reported in the paper (Inference-only approach)

Comparison to Prior Work

vs. Logic-LM: SymbCoT performs reasoning *within* the LLM rather than offloading to an external solver, improving error tolerance and explainability.
vs. CoT: SymbCoT forces the LLM to use symbolic representations and explicit logic rules, reducing hallucinations.
vs. Chain-of-Verification [not cited in paper]: SymbCoT verifies specifically for symbolic translation accuracy and logical rule adherence, not just general factual consistency.

Limitations

Evaluated only on two symbolic structures (First-Order Logic and Constraint Optimization).
Higher computational cost (tokens/latency) due to the multi-stage (Translate-Plan-Solve-Verify) process.
Dependent on the underlying LLM's ability to plan; does not improve the intrinsic planning capability.

Reproducibility

Code: https://github.com/Aiden0526/SymbCoT

Code is publicly available at https://github.com/Aiden0526/SymbCoT. The paper provides prompt examples in Section 3.3. Uses standard OpenAI API models.

📊 Experiments & Results

Evaluation Setup

Logical reasoning on multiple-choice or True/False/Unknown tasks using symbolic representations.

Benchmarks:

PrOntoQA (Synthetic logical reasoning (FOL))
ProofWriter (Synthetic logical reasoning (FOL))
FOLIO (Natural language logical reasoning (FOL))
LogicalDeduction (Constraint Optimization (CO))
AR-LSAT (Analytical reasoning from LSAT (CO))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on First-Order Logic (FOL) datasets shows SymbCoT generally surpassing both pure CoT and external-solver methods (Logic-LM).
FOLIO	Accuracy	78.92	83.33	+4.41
ProofWriter	Accuracy	79.66	82.50	+2.84
PrOntoQA	Accuracy	98.79	99.60	+0.81
Performance on Constraint Optimization (CO) datasets demonstrates SymbCoT's flexibility across different symbolic forms.
LogicalDeduction	Accuracy	87.63	93.00	+5.37
AR-LSAT	Accuracy	43.04	43.91	+0.87
Ablation studies reveal the critical role of the Planner and Solver modules.
ProofWriter	Accuracy	52.70	82.50	+29.80

Experiment Figures

Performance comparison between CoT and SymbCoT across different reasoning depths (complexity) on ProofWriter.

Execution success rate of symbolic expressions for Logic-LM (external solver) vs. SymbCoT (LLM solver).

Main Takeaways

SymbCoT consistently outperforms baselines (CoT, Logic-LM) across 5 datasets, with larger gains on complex reasoning tasks (greater depth).
The fully LLM-based approach is robust to symbolic syntax errors, achieving 100% execution rate on AR-LSAT where external solvers failed 32.6% of the time.
Using a Verifier eliminates 'unfaithful' reasoning (correct answer derived from wrong logic), which occurred in 6% of CoT cases.
The 'Planner' and 'Solver' modules are the most impactful components, contributing ~10.4% improvement on average.

📚 Prerequisite Knowledge

Prerequisites

Basic logic (First-Order Logic, Modus Ponens)
Chain-of-Thought (CoT) prompting
Large Language Models (LLMs)

Key Terms

FOL: First-Order Logic—a formal system using quantifiers (∀, ∃) and predicates to express logical relations.

CO: Constraint Optimization—a symbolic representation focusing on satisfying a set of constraints to find an optimal solution.

CoT: Chain-of-Thought—a prompting technique encouraging models to generate intermediate reasoning steps.

Neuro-symbolic AI: Systems combining neural networks (like LLMs) with symbolic logic/reasoning components.

Modus Ponens: A rule of logic stating that if 'If P then Q' is true and 'P' is true, then 'Q' must be true.

Modus Tollens: A rule of logic stating that if 'If P then Q' is true and 'Q' is false, then 'P' must be false.