Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

📝 Paper Summary

Memory recall Agentic AI

Agentic Plan Caching reduces serving costs by extracting high-level plan templates from past executions and adapting them for new, semantically similar requests using lightweight models.

Core Problem

Existing LLM caching methods (context/semantic caching) fail for agents because agent outputs depend on dynamic external data, meaning identical queries require different specific actions despite sharing high-level plans.

Why it matters:

Plan-Act agents incur substantial latency and cost due to complex reasoning and repeated planning steps (often using expensive models like reasoning LLMs).
Standard caching misses reuse opportunities because it cannot separate a query's core intent from its data-dependent context (e.g., specific file names or GUI coordinates).
Current memory approaches focus on capability (accuracy/hallucination) rather than the critical need for efficient, low-cost serving.

Concrete Example: For a data analysis request 'summarize key statistics,' standard caching fails because the specific actions depend on the dataset (different columns/values). APC recognizes the shared intent 'summary,' retrieves a general plan template, and fills in the specific dataset details.

Key Novelty

Agentic Plan Caching (APC)

Shifts from query-level caching (exact text match) to task-level caching by extracting reusable 'plan templates' that strip away specific context (e.g., entity names).
Uses keyword extraction rather than semantic embeddings for cache lookups, avoiding false positives caused by irrelevant details in complex agent queries.
Employs a lightweight 'adapter' model to fill in a retrieved template with current context, bypassing the expensive 'planner' model used on cache misses.

Architecture

The end-to-end inference flow of Agentic Plan Caching, showing the Hit/Miss paths and the template extraction process.

Evaluation Highlights

Reduces average cost by 50.31% and latency by 27.28% across five agent workloads compared to baselines.
Maintains 96.61% of optimal application performance relative to using the expensive planner for every request.
Compatible with existing LLM serving frameworks and can function alongside standard caching techniques.

Breakthrough Assessment

8/10

Significant practical contribution for deploying agents at scale. Addresses a specific failure mode of standard caching (data-dependency) with a cost-effective architectural solution.

⚙️ Technical Details

Problem Definition

Setting: Optimization of serving costs and latency for Plan-Act agentic workflows

Inputs: Natural language task query and external context (e.g., environment, data)

Outputs: Agent execution actions and final response

Pipeline Flow

Keyword Extraction (GPT-4o-mini)
Cache Lookup (Exact Match)
Plan Generation/Adaptation (Large or Small LM)
Actor Execution (Model dependent on task)
Post-Execution Template Extraction (Offline/Async)

System Modules

Keyword Extractor

Extracts a keyword capturing the high-level intent of the task query

Model or implementation: GPT-4o-mini

Plan Cache

Stores and retrieves (keyword, plan template) pairs

Model or implementation: Hash Map / Key-Value Store

Small Planner LM

Adapts a retrieved plan template to the specific context of the current request

Model or implementation: LLaMA-3.1-8B

Large Planner LM

Generates a plan from scratch when no template is found

Model or implementation: Not explicitly specified (task dependent, assumed expensive)

Template Extractor

Generalizes successful execution logs into reusable templates

Model or implementation: Rule-based filter + Lightweight LLM

Novel Architectural Elements

Test-time template extraction pipeline that converts execution logs into generalized plan templates
Two-path planning architecture: Adaptation path (Small LM + Template) vs. Generation path (Large LM)

Modeling

Base Model: LLaMA-3.1-8B (Small Planner), GPT-4o-mini (Keyword Extraction)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Context Caching: APC allows fuzzy matching of intent rather than exact prefix matching, and handles data-dependent variations.
vs. Semantic Caching: APC stores 'templates' rather than final outputs, allowing adaptation to new contexts (e.g., different screen coordinates).
vs. MemGPT [not cited in paper]: APC focuses on serving cost/latency reduction via templating, whereas MemGPT focuses on context window management for long-term coherence.

Limitations

Relies on the assumption that tasks with the same keyword share structural plan similarities.
Requires a warm-up phase to populate the cache; initial performance resembles the baseline.
The 'Small LM' adapter must be capable enough to instantiate templates correctly; if it fails, the cost advantage is lost.

Reproducibility

Code availability is not provided in the paper text. The method relies on standard LLMs (GPT-4o-mini, LLaMA-3.1-8B). Prompts for extraction and adaptation are described conceptually.

📊 Experiments & Results

Evaluation Setup

Evaluation on 5 diverse agent workloads measuring cost, latency, and accuracy.

Benchmarks:

Five diverse agent workloads (Varied (coding, web navigation, etc. - inferred from Introduction, specific dataset names not explicitly listed in text segments provided))

Metrics:

Cost (USD/tokens)
Latency (time)
Accuracy (success rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across 5 workloads	Cost Reduction	0.00	50.31	-50.31
Average across 5 workloads	Latency Reduction	0.00	27.28	-27.28
Average across 5 workloads	Performance Retention	100.00	96.61	-3.39

Main Takeaways

Query-based similarity matching (standard semantic caching) is sub-optimal for agents due to high false positives/negatives from context details.
Small planner LMs struggle with long-context raw execution logs; structured 'plan templates' are necessary for effective reuse.
The system effectively separates core intent from dynamic context, enabling reuse where traditional caching fails.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the ReAct (Reason+Act) agent loop
Familiarity with LLM caching mechanisms (KV cache, semantic cache)
Basic knowledge of LLM inference costs (tokens)

Key Terms

Plan-Act Agent: An agent architecture that alternates between generating a strategy (Plan) and executing it (Act) using external tools

Context Caching: Storing internal model states (KV pairs) to speed up generation for identical prompts

Semantic Caching: Storing (input, output) pairs to reuse responses for semantically similar queries

KV Cache: Key-Value cache; intermediate representations stored during LLM inference to avoid recomputing attention scores

SGLang: A framework for efficient execution of structured language model programs

ReAct: Reasoning and Acting—a paradigm where models generate reasoning traces before executing actions

Chain-of-Thought: Prompting technique where the model generates intermediate reasoning steps before the final answer