| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| ReasoningBank improves success rates across different LLM backbones on WebArena compared to memory-free and baseline memory methods. | ||||
| WebArena | Success Rate | 39.8 | 48.1 | +8.3 |
| WebArena | Success Rate | 46.2 | 48.1 | +1.9 |
| SWE-Bench-Verified | Success Rate | 30.8 | 34.6 | +3.8 |
| Efficiency gains: ReasoningBank reduces the number of steps required to complete tasks. | ||||
| WebArena | Avg Steps | 11.5 | 9.9 | -1.6 |
| MaTTS scaling experiments show that memory enhances the effectiveness of test-time scaling. | ||||
| WebArena-Shopping | Success Rate (BoN) | 40.6 | 55.1 | +14.5 |
| WebArena-Shopping | Success Rate | 52.4 | 55.1 | +2.7 |