Evaluation Setup
Evaluation across 9 benchmarks in 5 domains (web search, embodied action, math, science, coding)
Benchmarks:
- ALFWorld (Embodied decision making)
- TriviaQA (Web search / QA)
- KodCode (Coding)
- GSM8K (Math reasoning)
- GPQA (Scientific reasoning)
Metrics:
- Success Rate / Accuracy (%)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results on Qwen3-8B showing MemGen surpassing both parametric and retrieval baselines. |
| ALFWorld |
Success Rate |
85.60 |
90.60 |
+5.00
|
| KodCode |
Success Rate |
72.90 |
76.16 |
+3.26
|
| PopQA |
Accuracy |
40.33 |
62.30 |
+21.97
|
| Results on smaller model (SmolLM3-3B) demonstrating significant gains where baselines struggle. |
| ALFWorld |
Success Rate |
18.96 |
63.60 |
+44.64
|
| TriviaQA |
Accuracy |
46.20 |
79.30 |
+33.10
|
Main Takeaways
- MemGen consistently outperforms retrieval-based methods (ExpeL, AWM), especially on reasoning-intensive tasks where static retrieval fails
- Emergent memory hierarchy: Post-hoc analysis reveals latent tokens specialize into planning, procedural, and working memory functions without explicit supervision
- Cross-domain generalization: Training on one domain (e.g., Math) improves performance on others (e.g., Science, Code), unlike SFT which often degrades unseen domains
- Continual learning: MemGen retains performance on earlier tasks (e.g., AQuA) better than SFT after sequential training on new tasks