| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| BABILong results demonstrate LM2's superiority in long-context reasoning across varying lengths. | ||||
| BABILong (0K context) | Accuracy | 76.4 | 92.5 | +16.1 |
| BABILong (4K context) | Accuracy | 48.4 | 55.9 | +7.5 |
| BABILong (Average across tasks) | Relative Improvement | 0 | 37.1 | +37.1 |
| MMLU results show that the memory module improves general capabilities rather than degrading them. | ||||
| MMLU (Average) | Accuracy | 28.0 | 29.4 | +1.4 |
| MMLU (Humanities) | Accuracy | 26.9 | 30.4 | +3.5 |