Think-in-Memory: Recalling and post-thinking enable LLMs with long-term memory

📝 Paper Summary

Memory organization Memory recall

TiM enables LLMs to maintain long-term memory by storing evolved thoughts rather than raw history, utilizing a post-thinking stage to update memory via insert, forget, and merge operations.

Core Problem

Existing memory-augmented LLMs store raw historical text, necessitating repeated reasoning over the same history for new queries, which leads to inconsistent reasoning paths and high retrieval costs.

Why it matters:

Repeated reasoning over raw history causes LLMs to generate biased or contradictory thoughts for the same context
Calculating pairwise similarity between queries and extensive raw history is computationally expensive and time-consuming for long-term dialogues
Without proper memory management, LLMs fail to maintain accurate long-term context, critical for applications like medical diagnosis

Concrete Example: In a math word problem involving egg sales, an LLM reasoning over raw history calculates profit differently in Turn 2 vs. Turn 3 because it re-processes the original text each time. In Turn 2, it says Janet has $10; in Turn 3, it re-calculates and erroneously concludes she has $20 due to an inconsistent reasoning path.

Key Novelty

Think-in-Memory (TiM) Framework

Decouples reasoning from recalling by storing 'thoughts' (conclusions) instead of raw text, preventing the need to re-reason over the same history
Introduces a 'post-thinking' stage where the agent analyzes its own response to update memory using human-like operations: insert, forget, and merge
Utilizes Locality-Sensitive Hashing (LSH) for efficient storage and retrieval, grouping similar thoughts to speed up access without full pairwise comparisons

Architecture

Overview of the TiM framework showing the Recalling and Post-thinking stages.

Evaluation Highlights

Achieves 0.970 retrieval accuracy on the KdConv dataset (Music topic) with ChatGLM, significantly outperforming baselines
Improves response correctness to 0.843 on the Real-world Medical Dataset (RMD) using ChatGLM, compared to 0.806 for the baseline
Reduces retrieval time to 0.5305 ms per retrieval compared to 0.6287 ms for the baseline pairwise similarity method

Breakthrough Assessment

7/10

TiM offers a logical evolution from raw-text memory to thought-based memory with specific maintenance operations (merge/forget). The reduction in repeated reasoning is a strong conceptual contribution, though tested primarily on smaller/simulated datasets.

⚙️ Technical Details

Problem Definition

Setting: Long-term conversation generation where an agent must generate response Ry for query Qx while maintaining historical context

Inputs: Current user query Q and historical conversation context stored as thoughts in memory M

Outputs: Natural language response R and updated memory M' containing new thoughts

Pipeline Flow

Input Query Processing: Generate embedding for query Q
Recall Stage: LSH Retrieval → Similarity Retrieval → Concatenate Thoughts
Response Generation: LLM generates Response R based on Q and Recalled Thoughts
Post-thinking Stage: LLM analyzes (Q, R) pair → Generates/Updates Thoughts → Updates Memory M

System Modules

Agent A

Core LLM responsible for generating responses and performing post-thinking operations (generating inductive thoughts)

Model or implementation: ChatGLM-6B or Baichuan2-13B

Memory Cache M

Stores history as key-value pairs where key is hash index and value is a thought

Model or implementation: Hash table structure

LSH Function F(x)

Maps embedding vectors to hash indices to group similar thoughts

Model or implementation: Random projection: argmax([xR; -xR])

Novel Architectural Elements

Post-thinking feedback loop: Explicit step after generation where the model self-updates memory via prompted operations (Insert, Forget, Merge)
Thought-based storage: Storing 'Inductive Thoughts' (entity relations) rather than raw tokens or text chunks
Two-stage Retrieval within Memory: LSH bucket selection followed by intra-bucket similarity search

Modeling

Base Model: ChatGLM (6.2B) and Baichuan2 (13B)

Training Method: Supervised Fine-Tuning with LoRA (Low-Rank Adaptation)

Adaptation: LoRA (rank r=16)

Key Hyperparameters:

lora_rank: 16
epochs: 10

Compute: Not reported in the paper

Comparison to Prior Work

vs. SiliconFriend: TiM stores 'thoughts' (reasoning results) instead of raw text and uses LSH for faster retrieval
vs. LongMem: TiM is LLM-agnostic (external memory) whereas LongMem requires modifying the LLM architecture
vs. SCM: TiM supports complex memory evolution (Merge/Forget) based on semantic content, whereas SCM mainly handles simple read/write

Limitations

Reliance on LLM for thought generation means memory quality depends heavily on the base model's capability
LSH collision handling and optimal hash size tuning are not detailed extensively
Evaluation on the Real-world Medical Dataset is relatively small (80 test conversations)

📊 Experiments & Results

Evaluation Setup

Multi-turn dialogue generation with long-term context requirements

Benchmarks:

Generated Virtual Dataset (GVD) (Long-term conversation (simulated))
KdConv (Multi-domain knowledge-driven conversation)
Real-world Medical Dataset (RMD) (Medical consultation dialogue) [New]

Metrics:

Retrieval Accuracy
Response Correctness
Contextual Coherence

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GVD (Chinese)	Retrieval Accuracy	0.840	0.850	+0.010
GVD (Chinese)	Response Correctness	0.418	0.605	+0.187
GVD (Chinese)	Contextual Coherence	0.428	0.665	+0.237
RMD (Medical)	Response Correctness	0.806	0.843	+0.037
RMD (Medical)	Contextual Coherence	0.893	0.943	+0.050
Simulated Retrieval	Retrieval Time (ms)	0.6287	0.5305	-0.0982

Experiment Figures

Tendency of retrieval accuracy with different k values on KdConv dataset.

Comparison of medical agent performance with and without TiM in a multi-turn diagnosis scenario.

Main Takeaways

TiM consistently improves Contextual Coherence across all datasets (GVD, KdConv, RMD), suggesting that storing 'thoughts' helps maintain better narrative flow than raw text.
The method is LLM-agnostic, showing improvements with both ChatGLM and Baichuan2, though ChatGLM generally showed higher absolute performance in the reported experiments.
LSH retrieval reduces latency compared to full pairwise similarity, making the system more scalable for very long contexts.
Top-k recall analysis shows that retrieval accuracy improves significantly as k increases, with Top-10 achieving 0.973 accuracy on KdConv.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs)
Retrieval-Augmented Generation (RAG)
Locality-Sensitive Hashing (LSH)

Key Terms

TiM: Think-in-Memory—a framework where LLMs store reasoning results ('thoughts') rather than raw text to avoid repeated processing

Inductive Thought: A concise text summary containing a relation between two entities, often formatted as a triple (Head, Relation, Tail)

Post-thinking: A stage after response generation where the LLM analyzes the interaction to generate, merge, or forget thoughts in the memory

LSH: Locality-Sensitive Hashing—an algorithmic technique that hashes similar input items into the same 'buckets' with high probability, enabling fast approximate nearest neighbor search

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

MemoryBank: A baseline memory mechanism inspired by the Ebbinghaus forgetting curve that stores raw dialogue text