Evaluation Setup
Evaluated on logical inference, arithmetic search, and competitive mathematics problems.
Benchmarks:
- FOLIO wiki (First-order logic inference)
- Game of 24 (Arithmetic search / Constraint satisfaction)
- MATH (Mathematical reasoning (Algebra, Geometry, etc.))
- AutoTNLI (Tabular Natural Language Inference)
Metrics:
- Accuracy
- Number of visited states (Efficiency)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results on Logic Tasks (FOLIO) show CR outperforming CoT methods, especially when data is curated. |
| FOLIO-wiki |
Accuracy |
85.02 |
87.45 |
+2.43
|
| FOLIO-wiki-curated |
Accuracy |
96.09 |
98.04 |
+1.95
|
| Game of 24 results demonstrate superior search efficiency and success rate compared to Tree-of-Thought. |
| Game of 24 |
Accuracy |
74 |
98 |
+24
|
| Game of 24 |
# Visited States |
61.72 |
14.86 |
-46.86
|
| MATH benchmark results show CR enhances mathematical reasoning, particularly when combined with a code environment. |
| MATH |
Overall Accuracy |
53.80 |
58.00 |
+4.20
|
| MATH (Level 5) |
Accuracy |
22.4 |
32.1 |
+9.7
|
| MATH |
Overall Accuracy (w/ Code) |
61.6 |
72.2 |
+10.6
|
Main Takeaways
- Decomposition of reasoning into Proposer/Verifier/Reporter roles significantly improves performance over monolithic generation.
- The cumulative DAG structure is more efficient than tree search (ToT), achieving higher accuracy with fewer visited states on search-intensive tasks.
- Integration with external verifiers (e.g., Python code) drastically boosts performance on math tasks, surpassing previous code-aided methods like PAL and ToRA.
- Ablation studies confirm that both the Verifier role and the cumulative context are essential for the performance gains.