Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

📝 Paper Summary

Context Optimization Self-Improving Agents LLM Memory

ACE treats context as an evolving, itemized playbook managed by specialized agents (Generator, Reflector, Curator) to accumulate detailed strategies without the information loss caused by iterative rewriting.

Core Problem

Current context adaptation methods suffer from brevity bias (optimizing for short, generic prompts) and context collapse (iterative rewriting by LLMs erases critical details over time).

Why it matters:

Agents and domain-specific tasks (e.g., finance) require accumulating detailed, comprehensive heuristics rather than concise summaries
Iterative rewriting methods like Reflexion or Dynamic Cheatsheet often degrade performance in long horizons by compressing away necessary knowledge
Monolithic context regeneration is computationally expensive and high-latency for real-time applications

Concrete Example: In AppWorld, a context at step 60 contained ~18k tokens with 66.7% accuracy. The next update collapsed it to 122 tokens, dropping accuracy to 57.1%—worse than the unadapted baseline.

Key Novelty

Agentic Context Engineering (ACE)

Treats context as a collection of structured 'bullets' (metadata + content) rather than a monolithic text block, allowing granular management
Decomposes adaptation into three roles: Generator (acts), Reflector (extracts lessons), and Curator (formats lessons), preventing the bottleneck of a single model doing everything
Uses 'Delta Updates' to append small batches of insights and a deterministic 'Grow-and-Refine' mechanism to merge/deduplicate, avoiding the variance of full LLM rewrites

Architecture

The agentic workflow of ACE, showing the interaction between the Generator, Reflector, Curator, and the Context Storage

Evaluation Highlights

Matches top-ranked IBM-CUGA (GPT-4.1) on AppWorld leaderboard using the smaller DeepSeek-V3.1, surpassing it by +8.4% on the 'test-challenge' split
+10.6% average gain on agent benchmarks and +8.6% on financial benchmarks compared to strong baselines like GEPA and Dynamic Cheatsheet
Reduces adaptation latency by 86.9% on average compared to GEPA by using incremental delta updates instead of full context rewrites

Breakthrough Assessment

9/10

Addresses the fundamental 'context collapse' problem in self-improving agents. Achieving SOTA results with open-weights models against GPT-4 competitors while drastically cutting latency is a significant practical breakthrough.

⚙️ Technical Details

Problem Definition

Setting: Context adaptation (offline system prompt optimization or online memory updates) without weight updates

Inputs: Task query q, current context C (playbook), execution environment signals

Outputs: Answer/Action a, Updated Context C'

Pipeline Flow

Generator (Produces trajectory/trace)
Reflector (Critiques trace, extracts lessons)
Curator (Synthesizes lessons into Delta Context)
Updater (Non-LLM logic merges Delta into Main Context via Grow-and-Refine)

System Modules

Generator

Solves the task using the current context and produces reasoning trajectories

Model or implementation: DeepSeek-V3.1 (non-thinking mode)

Reflector (Adaptation)

Analyzes the Generator's trace to identify specific successes and failures

Model or implementation: DeepSeek-V3.1 (non-thinking mode)

Curator (Adaptation)

Formats extracted lessons into structured 'bullets' with metadata

Model or implementation: DeepSeek-V3.1 (non-thinking mode)

Updater

Merges Delta Context into Main Context using deterministic logic

Model or implementation: Non-LLM Python Logic

Novel Architectural Elements

Separation of Reflector (analysis) and Curator (formatting) roles
Delta-based update mechanism (merging itemized bullets instead of text rewriting)
Grow-and-refine algorithm relying on embedding similarity rather than LLM summarization

Modeling

Base Model: DeepSeek-V3.1 (non-thinking mode)

Comparison to Prior Work

vs. GEPA: ACE avoids brevity bias by accumulating detailed 'playbooks' rather than evolving a single prompt instruction
vs. Dynamic Cheatsheet: ACE uses itemized delta updates and a dedicated Reflector/Curator to prevent context collapse, whereas DC relies on full rewrites
vs. OPRO [not cited in paper]: ACE optimizes unstructured context via agentic feedback loops rather than just optimizing the instruction string via hill-climbing

Limitations

Depends on the availability of reliable execution signals (e.g., code success); performance degrades without them
If feedback is noisy, the context can be polluted with spurious or misleading signals
Requires a model capable of following the Reflector/Curator breakdown (DeepSeek-V3 level)

📊 Experiments & Results

Evaluation Setup

Offline context optimization (on train set) and Online test-time adaptation (on test set)

Benchmarks:

AppWorld (Autonomous Agent (API/Coding))
FiNER (Financial NER (XBRL))
Formula (Financial Numerical Reasoning)

Metrics:

Task Goal Completion (TGC)
Scenario Goal Completion (SGC)
Accuracy (Pass@1)
Adaptation Latency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Offline adaptation results showing ACE's ability to construct superior system prompts compared to optimization baselines.
AppWorld	TGC (Task Goal Completion)	44.6	56.5	+11.9
FiNER	Accuracy	78.4	89.3	+10.9
Online adaptation results demonstrating the benefit of evolving memory during test time.
AppWorld	TGC (Task Goal Completion)	51.8	59.4	+7.6
AppWorld (Test-Challenge)	TGC	50.0	58.4	+8.4
Efficiency metrics showing ACE is faster and cheaper due to delta updates.
AppWorld (Offline)	Adaptation Latency (Reduction)	0	82.3	+82.3
FiNER (Online)	Adaptation Latency (Reduction)	0	91.5	+91.5

Experiment Figures

A visualization of 'Context Collapse' in prior methods vs. ACE

AppWorld Leaderboard snapshot

Main Takeaways

ACE allows smaller open-source models (DeepSeek-V3) to match or beat proprietary models (GPT-4) by saturating context with high-quality, evolved strategies
Separating Reflection and Curation prevents the 'context collapse' seen in methods that rely on end-to-end rewriting
Incremental 'delta updates' drastically reduce latency and cost, making test-time adaptation practical for real-time agents
Effectiveness is maintained even without ground-truth labels, leveraging natural execution feedback (e.g., code success) for self-improvement

📚 Prerequisite Knowledge

Prerequisites

In-context learning (ICL)
ReAct agent framework
Prompt optimization techniques

Key Terms

Brevity Bias: The tendency of optimization methods to converge on short, generic instructions that lose domain-specific nuance

Context Collapse: The phenomenon where iterative rewriting of context by an LLM causes a sudden, drastic loss of accumulated information and performance

ReAct: Reason+Act—a paradigm where agents generate reasoning traces before taking actions

GEPA: Genetic-Pareto—a prompt optimization method using evolutionary algorithms and natural language reflection

XBRL: eXtensible Business Reporting Language—a standard format for financial reporting used in the FiNER/Formula benchmarks

Delta Updates: Small, incremental additions to the context (like a patch) rather than regenerating the entire context text

DeepSeek-V3.1: The specific open-weights large language model used as the backbone for all agents in this paper