Evaluation Setup
Logical reasoning on multiple-choice or True/False/Unknown tasks using symbolic representations.
Benchmarks:
- PrOntoQA (Synthetic logical reasoning (FOL))
- ProofWriter (Synthetic logical reasoning (FOL))
- FOLIO (Natural language logical reasoning (FOL))
- LogicalDeduction (Constraint Optimization (CO))
- AR-LSAT (Analytical reasoning from LSAT (CO))
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparative performance on First-Order Logic (FOL) datasets shows SymbCoT generally surpassing both pure CoT and external-solver methods (Logic-LM). |
| FOLIO |
Accuracy |
78.92 |
83.33 |
+4.41
|
| ProofWriter |
Accuracy |
79.66 |
82.50 |
+2.84
|
| PrOntoQA |
Accuracy |
98.79 |
99.60 |
+0.81
|
| Performance on Constraint Optimization (CO) datasets demonstrates SymbCoT's flexibility across different symbolic forms. |
| LogicalDeduction |
Accuracy |
87.63 |
93.00 |
+5.37
|
| AR-LSAT |
Accuracy |
43.04 |
43.91 |
+0.87
|
| Ablation studies reveal the critical role of the Planner and Solver modules. |
| ProofWriter |
Accuracy |
52.70 |
82.50 |
+29.80
|
Main Takeaways
- SymbCoT consistently outperforms baselines (CoT, Logic-LM) across 5 datasets, with larger gains on complex reasoning tasks (greater depth).
- The fully LLM-based approach is robust to symbolic syntax errors, achieving 100% execution rate on AR-LSAT where external solvers failed 32.6% of the time.
- Using a Verifier eliminates 'unfaithful' reasoning (correct answer derived from wrong logic), which occurred in 6% of CoT cases.
- The 'Planner' and 'Solver' modules are the most impactful components, contributing ~10.4% improvement on average.