Evaluation Setup
Math word problem solving on the GSM8K dataset
Benchmarks:
- GSM8K (Grade-school math word problems)
Metrics:
- Accuracy (percentage of correct answers)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main experiments on GSM8K comparing baseline GPT-3.5 against ARM-RAG variants. |
| GSM8K |
Accuracy |
73.2 |
75.3 |
+2.1
|
| GSM8K |
Accuracy |
73.2 |
77.4 |
+4.2
|
Main Takeaways
- Retrieving relevant reasoning chains improves performance over standard prompting, but naive retrieval often fetches superficially similar (same topic) rather than structurally similar (same math logic) problems
- Obfuscating the query (masking nouns/names) forces the dense retriever to focus slightly more on structure, yielding better demonstrations and higher accuracy
- The 'upper bound' capability of the model is high (91.9% with multi-attempt voting), suggesting that the main bottleneck is selecting the right context/strategy
- Strong negative prompting (providing incorrect answers as context) has little detrimental effect, whereas strong positive prompting (providing the answer) helps significantly