Evaluation Setup
Language modeling, factual precision benchmarks, and machine unlearning scenarios
Benchmarks:
- Wikipedia Validation Set (Language Modeling (Perplexity))
- TOFU (Machine Unlearning)
- FactScore (Long-form biography generation)
- T-REx (Short-form factual completion)
- PopQA (Long-tail QA)
Metrics:
- Perplexity (Static, Dynamic, Normalized)
- Model Utility (ROUGE, Probability, Truth Ratio)
- Forget Quality (p-value)
- FactScore (%)
- Exact Match (EM)
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Factual precision comparisons showing LmLm outperforms standard models of the same size and approaches much larger models. |
| FactScore |
% |
13.5 |
31.4 |
+17.9
|
| T-REx |
EM |
20.6 |
26.7 |
+6.1
|
| PopQA |
Accuracy |
14.4 |
42.5 |
+28.1
|
| Perplexity results showing LmLm is more efficient at modeling text when allowed to look up facts. |
| Wikipedia Validation |
Dynamic Perplexity |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- LmLm achieves competitive performance compared to significantly larger LLMs (e.g., 382M LmLm matching 7B LLaMA2 in factual precision).
- Decoupling knowledge allows for perfect unlearning by simply deleting database entries, preserving model utility where other methods (NPO) degrade it.
- Learning to lookup facts is empirically easier for the model than memorizing them, reflected in faster convergence and lower perplexity.
- LmLm preserves general NLU capabilities, performing on par with standard models on tasks like ARC, HellaSwag, and MMLU.