Evaluation Setup
Train on 100 sequential conversations; Test on held-out 'yes/no' questions regarding the conversation content.
Benchmarks:
- Memory Accuracy (Binary classification (Yes/No) on history) [New]
- MMLU (General Knowledge)
- HellaSwag (Commonsense Reasoning)
Metrics:
- Accuracy (%)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of PLUM against RAG baselines on memory retention accuracy over 100 conversations. |
| Memory Accuracy (100 conversations) |
Accuracy |
83.5 |
81.5 |
-2.0
|
| Memory Accuracy (100 conversations) |
Accuracy |
81.5 |
81.5 |
0.0
|
| Evaluation on general benchmarks to check for catastrophic forgetting of general capabilities. |
| MMLU (5-shot) |
Accuracy |
65.65 |
64.93 |
-0.72
|
| ARC (Challenge, 25-shot) |
Accuracy |
59.39 |
58.45 |
-0.94
|
Main Takeaways
- PLUM offers a viable parametric alternative to RAG for conversation history, achieving competitive accuracy (within 2%) without external storage
- Negative samples (questions about what was *not* discussed) are critical; without them, the model defaults to answering 'yes' to everything
- Weighted cross-entropy loss is essential to force the model to learn the specific memory content rather than just the instruction format