← Back to Paper List

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

Tony Mason
University of British Columbia, Georgia Institute of Technology
arXiv (2026)
Memory Agent

📝 Paper Summary

Context Window Management Agentic AI Systems
Pichay acts as a virtual memory manager for LLMs by treating the context window as L1 cache, implementing demand paging to evict stale content and handling retrieval faults via a transparent proxy.
Core Problem
Current agentic systems treat the context window as infinite memory rather than expensive cache, leading to structural waste and quadratic cost scaling as sessions grow.
Why it matters:
  • Accumulating every tool definition and stale result in context creates structural waste (21.8% of tokens in production sessions)
  • Attention costs scale quadratically with context length, making long-running agentic sessions economically unviable
  • Current approaches increase window size (vertical scaling) instead of managing working sets (architectural scaling), leading to 'thrashing' where systems struggle to maintain state
Concrete Example: In a coding session, an agent reads 'file.py', edits it, and never references the original read result again. Standard systems keep the stale read result in context forever. Pichay evicts it, replacing it with a handle like '[Paged out: Read file.py...]', saving tokens while allowing restoration if needed.
Key Novelty
Virtual Memory Hierarchy for LLMs (Pichay)
  • treats the context window as L1 cache and introduces a demand paging system that evicts cold content to a backing store (L2), leaving a small retrieval handle in context
  • implements fault-driven pinning, where content that is evicted and immediately re-requested is 'pinned' to prevent future eviction (learning the working set)
  • uses 'phantom tools' as a side channel, allowing the model to cooperatively signal when to release files or request restoration of paged-out content
Evaluation Highlights
  • Reduces context consumption by 93% (5,038KB to 339KB) in a live production session of 681 turns while maintaining operation
  • Achieves a low page fault rate of 0.0254% across 1.4 million simulated evictions in offline replay
  • Identifies that 21.8% of production tokens are structural waste, primarily from unused tool schemas (11.0%) and stale tool results (8.7%)
Breakthrough Assessment
9/10
Fundamentally reframes context management using proven OS principles (virtual memory) rather than ad-hoc compression. Provides strong empirical evidence of structural waste and a viable system solution.
×