| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Memory Classification results demonstrate that specialized smaller models (BERT) outperform general-purpose LLMs in categorizing memory types. | ||||
| PerLTQA | Accuracy | 0.4025 | 0.9634 | +0.5609 |
| PerLTQA | Accuracy | 0.3340 | 0.9634 | +0.6294 |
| Memory Retrieval experiments show the effectiveness of different retrievers on the PerLTQA dataset. | ||||
| PerLTQA | Recall@10 | 0.584 | 0.767 | +0.183 |
| PerLTQA | Recall@10 | 0.627 | 0.767 | +0.140 |
| Memory Synthesis results comparing LLM performance when provided with gold-standard (Oracle) memories. | ||||
| PerLTQA | Correctness (GPT-4 score 1-5) | 4.17 | 4.82 | +0.65 |