| Benchmark | Metric | Baseline | This Paper | Ξ |
|---|---|---|---|---|
| Symbolic Manipulation (Last-Letter-Concatenation) results showing generalization to sequences longer than those seen in prompts. | ||||
| Last-Letter-Concatenation | Accuracy | 31.8 | 74.0 | +42.2 |
| Last-Letter-Concatenation | Accuracy | 84.2 | 94.0 | +9.8 |
| Compositional Generalization (SCAN) results on the challenging Length Split. | ||||
| SCAN (Length Split) | Accuracy | 16.2 | 99.7 | +83.5 |
| Math Reasoning results showing improvements on difficult multi-step problems. | ||||
| GSM8K | Accuracy | 39.07 | 45.23 | +6.16 |
| DROP (Football subset) | Accuracy | 59.56 | 73.42 | +13.86 |