← Back to Paper List

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang
Zhejiang University, National University of Singapore, Nanjing University
arXiv.org (2025)
Memory RAG Benchmark

📝 Paper Summary

Memory organization Layered memory Memory efficiency
LightMem emulates human memory stages to filter redundant inputs and perform offline consolidation, drastically reducing token costs and latency while improving retrieval accuracy.
Core Problem
Current LLM memory systems suffer from high computational overhead due to processing redundant raw data and high latency caused by tightly coupled online memory updates.
Why it matters:
  • Long-context interactions (e.g., personal assistants) generate massive redundancy, inflating token costs without improving reasoning
  • Real-time memory updates during user interactions introduce unacceptable latency
  • Rigid segmentation (e.g., fixed token windows) often splits semantic units, leading to fragmented or inaccurate memory representations
Concrete Example: In a long dialogue, a user might provide 10 turns of chatty context before a core request. A standard system summarizes all 10 turns (high cost) and updates the memory database immediately (high latency). LightMem compresses the chatty turns first and delays the heavy database reorganization until the system is 'asleep' (offline).
Key Novelty
Three-Stage Cognitive Memory Architecture (Sensory, Short-Term, Long-Term)
  • Introduces a 'Sensory Memory' stage that uses lightweight compression to filter low-value tokens before they ever reach the main memory system
  • Implements 'Sleep-time' updates that allow the system to insert memories quickly during chats ('soft updates') and perform heavy reorganization/de-duplication offline
  • Uses dynamic topic-based segmentation for Short-Term Memory rather than fixed turn counts, ensuring semantically complete memory chunks
Evaluation Highlights
  • Achieves up to 29.3% accuracy improvement on LoCoMo benchmark using Qwen backbone compared to strong baselines
  • Reduces total token usage by up to 38x (GPT backbone) and online test-time token usage by over 100x compared to standard memory systems
  • Lowers API calls by up to 30x on LongMemEval while consistently surpassing the strongest baseline (A-MEM) in accuracy
Breakthrough Assessment
8/10
Significant efficiency breakthrough (order-of-magnitude reduction in tokens/calls) while simultaneously improving accuracy. The 'sleep-time' update mechanism effectively addresses the latency bottleneck of prior memory systems.
×