| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Offline adaptation results showing ACE's ability to construct superior system prompts compared to optimization baselines. | ||||
| AppWorld | TGC (Task Goal Completion) | 44.6 | 56.5 | +11.9 |
| FiNER | Accuracy | 78.4 | 89.3 | +10.9 |
| Online adaptation results demonstrating the benefit of evolving memory during test time. | ||||
| AppWorld | TGC (Task Goal Completion) | 51.8 | 59.4 | +7.6 |
| AppWorld (Test-Challenge) | TGC | 50.0 | 58.4 | +8.4 |
| Efficiency metrics showing ACE is faster and cheaper due to delta updates. | ||||
| AppWorld (Offline) | Adaptation Latency (Reduction) | 0 | 82.3 | +82.3 |
| FiNER (Online) | Adaptation Latency (Reduction) | 0 | 91.5 | +91.5 |