Evaluation Setup
Sequence generation requiring multi-step reasoning
Benchmarks:
- Chain-of-Parities (Synthetic reasoning (Contextual blind cliff walk)) [New]
- GSM8k (Grade school math)
- MATH (Advanced math problems)
- GSM8k-Base-7 (Symbolic shift (numeric format change)) [New]
- GSM8k-Tensor-2 (Long-horizon reasoning (concatenated problems)) [New]
Metrics:
- Success Rate (Reward)
- Pass@1
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results on the synthetic 'Chain-of-Parities' task demonstrate the separation between AdaBack and baselines. While exact percentages aren't in the snippet, the text explicitly contrasts 'substantial reward' with failure. |
| Chain-of-Parities (L=16) |
Learning Outcome |
Fails to obtain meaningful rewards |
Reliably solves the task |
Success vs Failure
|
Main Takeaways
- Separation Result: There exists a class of problems (like Chain-of-Parities) where SFT fails (due to sample complexity) and RL fails (due to sparse rewards), but AdaBack succeeds.
- Backtracking works by converting a complex search (probability p^n) into n simpler sub-searches (probability p), enabling gradient-based learning to trace dependencies backwards.
- AdaBack generalizes better to symbolic shifts (Base-7) and longer horizons (Tensor-2) compared to standard baselines, suggesting it learns robust reasoning operators rather than memorizing surface patterns.
- The method is less effective when the model is already highly competent (e.g., strong instruct-tuned models), as the 'guided exploration' benefit becomes redundant.