The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

📝 Paper Summary

Context Window Management Agentic AI Systems

Pichay acts as a virtual memory manager for LLMs by treating the context window as L1 cache, implementing demand paging to evict stale content and handling retrieval faults via a transparent proxy.

Core Problem

Current agentic systems treat the context window as infinite memory rather than expensive cache, leading to structural waste and quadratic cost scaling as sessions grow.

Why it matters:

Accumulating every tool definition and stale result in context creates structural waste (21.8% of tokens in production sessions)
Attention costs scale quadratically with context length, making long-running agentic sessions economically unviable
Current approaches increase window size (vertical scaling) instead of managing working sets (architectural scaling), leading to 'thrashing' where systems struggle to maintain state

Concrete Example: In a coding session, an agent reads 'file.py', edits it, and never references the original read result again. Standard systems keep the stale read result in context forever. Pichay evicts it, replacing it with a handle like '[Paged out: Read file.py...]', saving tokens while allowing restoration if needed.

Key Novelty

Virtual Memory Hierarchy for LLMs (Pichay)

treats the context window as L1 cache and introduces a demand paging system that evicts cold content to a backing store (L2), leaving a small retrieval handle in context
implements fault-driven pinning, where content that is evicted and immediately re-requested is 'pinned' to prevent future eviction (learning the working set)
uses 'phantom tools' as a side channel, allowing the model to cooperatively signal when to release files or request restoration of paged-out content

Evaluation Highlights

Reduces context consumption by 93% (5,038KB to 339KB) in a live production session of 681 turns while maintaining operation
Achieves a low page fault rate of 0.0254% across 1.4 million simulated evictions in offline replay
Identifies that 21.8% of production tokens are structural waste, primarily from unused tool schemas (11.0%) and stale tool results (8.7%)

Breakthrough Assessment

9/10

Fundamentally reframes context management using proven OS principles (virtual memory) rather than ad-hoc compression. Provides strong empirical evidence of structural waste and a viable system solution.

⚙️ Technical Details

Problem Definition

Setting: Management of finite context window resources during long-context agentic AI sessions

Inputs: API requests containing system prompts, tool definitions, and message history

Outputs: Inference provider responses (unmodified by proxy except for side-channel handling)

Pipeline Flow

Client Request → Pichay Proxy (Interposition)
Eviction/GC Policy (Filter Message Array)
Inference API (External LLM)
Response Stream Interception (Phantom Tool Handling)

System Modules

Pichay Proxy

Transparent HTTP proxy that intercepts the message array before it reaches the inference provider

Model or implementation: N/A (System Component)

Eviction Engine (Memory Management)

Applies policies to remove content. Distinguishes 'Garbage Collection' (ephemeral outputs) from 'Paging' (addressable content).

Model or implementation: Rule-based (FIFO)

Fault Handler (Memory Management)

Detects when model invokes a tool matching an evicted entry (Page Fault) and restores content

Model or implementation: Hash-based lookup

Novel Architectural Elements

Proxy-based memory hierarchy implementation enabling L1/L2 management without model training
Use of 'Phantom Tools' as a cooperative side-channel for the model to explicitly manage its own memory
Distinction between 'Garbage Collection' (unrecoverable) and 'Paging' (recoverable via fault)

Modeling

Base Model: Agentic AI tools (e.g., Claude Code, Cursor) utilizing various underlying LLMs

Comparison to Prior Work

vs. MemGPT: Pichay uses cost-driven eviction and specific fault-handling handles, whereas MemGPT triggers summarization based on capacity limits.
vs. LLMLingua: Pichay removes structural blocks (waste) rather than performing lossy token-level compression.
vs. CMV: Pichay implements a full paging hierarchy with retrieval/faulting, whereas CMV focuses on irreversible trimming.
+ 1 more
vs. SideQuest [not cited in paper]: SideQuest uses a parallel reasoning thread to evict KV cache; Pichay operates at the prompt level via a proxy, requiring no inference engine access.

Limitations

Thrashing pathology observed under extreme pressure when working set exceeds window size
Current eviction policy is simple FIFO; more sophisticated policies (LRU) not yet evaluated at scale
L3 (compaction) and L4 (cross-session) levels described but not fully evaluated in production
Relies on the model correctly interpreting the 'Paged out' retrieval handles

Reproducibility

Code availability is not provided in the paper. The system is described as a proxy implementation 'Pichay'. Methodology is reproducible via standard proxy patterns and the heuristic logic described (tau=4, s_min=500 bytes).

📊 Experiments & Results

Evaluation Setup

Analysis of production sessions from agentic AI coding assistants

Benchmarks:

Production Corpus (Real-world agentic coding sessions) [New]

Metrics:

Structural Waste Percentage
Page Fault Rate
Amplification Factor
Context Reduction Percentage
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Empirical analysis of waste in standard production sessions.
Production Corpus	Structural Waste (Total)	0	21.8	+21.8%
Production Corpus	Unused Tool Schemas	0	11.0	+11.0%
Production Corpus	Amplification Factor (Stale Results)	1.0	84.4	+83.4
Performance of the Pichay system in simulation and deployment.
Simulation (1.4M evictions)	Fault Rate	0	0.0254	0.0254%
681-turn Session	Context Size	5038	339	-4699

Main Takeaways

Context windows behave like L1 cache; treating them as persistent memory causes massive structural waste (21.8% of tokens).
Simple FIFO eviction with fault-driven pinning is highly effective, achieving a negligible fault rate (0.0254%) in typical workloads.
Models demonstrate emergent cooperation: they intuitively understand 'Paged out' handles and utilize phantom tools to manage their own working set without specific fine-tuning.
The amplification factor of 84.4x for stale results confirms the quadratic cost explosion predicted by theoretical models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Operating System memory management (virtual memory, paging, faults)
Familiarity with Agentic AI tool use flows
Knowledge of LLM context window limits and attention mechanisms

Key Terms

L1 Cache: The active context window used for generation—small, fast, and expensive, analogous to CPU cache

Demand Paging: Loading data into memory only when explicitly requested (faulted in) rather than keeping everything resident

Page Fault: An event where the model requests content that has been evicted; the system must retrieve it from backing store

Structural Waste: Tokens occupying context that serve no functional purpose, such as unused tool schemas or stale outputs

Thrashing: A pathological state where the system spends more resources moving data in and out of memory (faulting) than performing useful work

Phantom Tools: Tool definitions injected by the proxy (invisible to the client framework) that allow the model to send control signals to the memory manager

Working Set: The subset of information the model is actively using at a given time