Evaluation Setup
Zero-shot and Few-shot Chain-of-Thought reasoning across three domains: Arithmetic, Commonsense, and Symbolic reasoning.
Benchmarks:
- GSM8K (Arithmetic Reasoning)
- MultiArith (Arithmetic Reasoning)
- AddSub (Arithmetic Reasoning)
- SingleEq (Arithmetic Reasoning)
- CSQA (Commonsense Reasoning)
- StrategyQA (Commonsense Reasoning)
- Last Letter Concatenation (Symbolic Reasoning)
- Coin Flip (Symbolic Reasoning)
Metrics:
- Accuracy (%)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results on Arithmetic Reasoning tasks showing improvements over the Zero-Shot-CoT baseline using the proposed Read-and-Control framework. |
| MultiArith |
Accuracy |
78.0 |
88.2 |
+10.2
|
| GSM8K |
Accuracy |
40.0 |
46.0 |
+6.0
|
| AddSub |
Accuracy |
69.1 |
76.4 |
+7.3
|
| SingleEq |
Accuracy |
83.6 |
88.7 |
+5.1
|
| Results on Commonsense Reasoning tasks using LLaMA-2-13B-chat. |
| CSQA |
Accuracy |
64.6 |
68.8 |
+4.2
|
| StrategyQA |
Accuracy |
63.2 |
66.8 |
+3.6
|
| Results on Symbolic Reasoning tasks using LLaMA-2-7B-chat. |
| Last Letter |
Accuracy |
21.6 |
30.0 |
+8.4
|
| Coin Flip |
Accuracy |
65.6 |
73.6 |
+8.0
|
Main Takeaways
- The Hopfieldian view framework consistently improves CoT accuracy across arithmetic, commonsense, and symbolic reasoning tasks.
- The 'Read' operation effectively localizes reasoning errors (as visualized in qualitative examples), identifying where the model deviates from the 'reasoning' concept.
- The 'Control' operation proves that correcting the representation direction towards the identified concept vector can fix reasoning paths without retraining.
- Improvements are robust across different model sizes (7B and 13B) and task types.