PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

📝 Paper Summary

Self-evolving Agentic reasoning Memory organization

PRECEPT enables LLM agents to adapt to drift and compose rules deterministically by combining exact-match retrieval, Bayesian conflict resolution, and evolutionary prompt optimization into a unified test-time framework.

Core Problem

LLM agents relying on verbal memory suffer from retrieval errors that scale exponentially with condition count, struggle to compose atomic rules, and fail to detect stale knowledge under environmental drift.

Why it matters:

Current verbal reflection methods degrade to ~94% error rates when interpreting complex conditions (N=10), making them unreliable for complex tasks.
Reinforcement learning is too sample-inefficient for deployment (requiring >100 samples) and requires retraining for drift.
Static agents fail significantly (72% failure rate) when environment dynamics change, requiring systems that can survive and adapt online.

Concrete Example: In a logistics task with 10 conditions, a standard verbal reflection agent attempting to retrieve relevant rules suffers a 94.4% interpretation error rate due to partial matching. PRECEPT uses structured keys to guarantee 0% retrieval error on the deterministic path.

Key Novelty

Unified Framework for Deterministic Adaptation (PRECEPT)

Replaces fuzzy natural language retrieval with deterministic exact-match lookup using structured keys, enabling reliable rule stacking via a semantic hierarchy.
Treats memory conflicts as a reliability problem using Bayesian tracking to distinguish between temporary outliers and genuine environmental drift.
Optimizes agent prompts using an evolutionary outer loop (COMPASS) that selects based on a Pareto frontier of success and efficiency, rather than just gradients or heuristics.

Evaluation Highlights

+41.1pp first-try success advantage over Full Reflexion (d>1.9 difficulty) across 9-10 seeds.
+55.0pp recovery from environmental drift (d=0.95, p=0.031) compared to baselines.
100% P1 score on 2-way logistics compositional tasks (d=2.64), demonstrating reliable rule composition.

Breakthrough Assessment

9/10

Addresses critical reliability bottlenecks in agents (determinism, drift, compositionality) with a theoretically grounded, unified architecture. Strong empirical gains (+41pp) and novel integration of evolutionary methods.

⚙️ Technical Details

Problem Definition

Setting: Test-time adaptation for LLM agents in environments with hidden constraints, compositional rules, and non-stationary dynamics (drift).

Inputs: Task description, environmental feedback, growing history of execution traces.

Outputs: Action plan executing the task while satisfying hidden constraints.

Pipeline Flow

Task Parsing (Rule-based/Hybrid)
COMPASS High-Freq Monitor (Constraint Checks)
Group: Dual-Mode Retrieval & Conflict Resolution
Decision Making (Exact-Match/LLM)
Domain Execution
Feedback & Learning (Evo-Memory Update)

System Modules

RefineInterceptor

Client-side guard that actively prunes actions violating known forbidden constraints.

Model or implementation: Rule-based Logic

Dual-Mode Retriever (Retrieval & Selection)

Fetches knowledge using both exact hash keys (Tier 1) and vector similarity (Tier 2).

Model or implementation: Hybrid (Hash Table + ChromaDB/BM25)

Conflict Ensemble (Retrieval & Selection)

Detects contradictions between static KB and dynamic experience.

Model or implementation: Ensemble (6 methods including Bayesian)

COMPASS

Optimizes system prompts and strategies based on success/efficiency metrics.

Model or implementation: Evolutionary Engine (GEPA extension)

Novel Architectural Elements

Tightly coupled 'Dual-Mode' retrieval where O(1) exact keys override semantic search to force determinism.
Bayesian 'Conflict-Aware' memory layer that acts as a reliability filter between retrieval and generation.
Split-frequency architecture: High-frequency runtime monitor + Low-frequency evolutionary prompt architect (COMPASS).

Modeling

Base Model: LLM Agent (Specific backbone not specified in paper text)

Comparison to Prior Work

vs. Full Reflexion: Uses deterministic structured retrieval to avoid interpretation degradation (0% error vs 94.4%).
vs. DRQ: Adds compositional rule learning and exact retrieval to the adversarial robustness concept.
vs. RL: Achieves adaptation with drastically fewer samples (beta <= 3 vs beta > 100) and handles drift without retraining.

Limitations

Conditional compositional guarantee depends on the semantic tier hierarchy matching the domain's true constraint priorities.
Requires domain-provided atomic condition vocabularies (does not discover arbitrary symbolic tokens from raw text).
Theoretical guarantees are based on modeled stationary segments and independence assumptions.

Reproducibility

Code availability is not provided. The paper mentions an MCP server implementation and domain-specific executors. Specific LLM backbones and hardware details are not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Agentic execution in Logistics, Booking, and Integration domains under static and drifting constraints.

Benchmarks:

Logistics Domain (Constraint Satisfaction / Planning) [New]
Booking Domain (Constraint Satisfaction) [New]
Integration Domain (Complex Multi-step Planning) [New]

Metrics:

First-try success rate
Compositional Generalization (P1 score)
Drift Recovery (pp gain)
Step count
Statistical methodology: Reported p-values (e.g., p<0.001, p=0.031) for core comparisons.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons against Full Reflexion baseline across different difficulty settings and metrics.
Logistics/Integration (Aggregated)	First-try advantage	Not explicitly reported in the paper	Not explicitly reported in the paper	+41.1pp
Logistics (2-way composition)	P1 Score	Not explicitly reported in the paper	100%	+33.3pp
Drifting Environment	Recovery Rate	Not explicitly reported in the paper	Not explicitly reported in the paper	+55.0pp
General Execution	Step Count Reduction	Not explicitly reported in the paper	Not explicitly reported in the paper	-61%

Main Takeaways

Deterministic retrieval eliminates the interpretation errors that plague verbal reflection agents, enabling reliable compositional stacking.
Bayesian conflict resolution successfully identifies and overrides stale knowledge, providing significant resilience against environmental drift.
The COMPASS evolutionary loop improves prompt quality over time, contributing to continuous learning gains of +40-55pp.
The framework achieves these gains while reducing computational steps by 61%, validating the efficiency of the exact-match fast path.

📚 Prerequisite Knowledge

Prerequisites

Agentic workflows (Reflection, Planning)
Bayesian inference (Posterior updates, Thompson Sampling)
Evolutionary Algorithms (Pareto optimization, MAP-Elites)

Key Terms

PRECEPT: The proposed framework: Planning Resilience via Experience, Context Engineering & Probing Trajectories.

COMPASS: Context-aware Multi-objective Pareto-guided Adaptive Strategy Search—the evolutionary outer loop that optimizes prompts.

RefineInterceptor: A client-side guard module that guarantees zero repeated failed actions by pruning forbidden constraints.

Type I Conflict: Static vs. Dynamic source conflict (e.g., old knowledge contradicts new observation), handled by Bayesian reliability tracking.

Type II Conflict: Environmental Drift (e.g., rules that were valid become invalid), handled by confidence decay.

Evo-Memory: A growing history of failure constraints and experiences used to detect conflicts.

Thompson Sampling: A randomized algorithm that balances exploration and exploitation by sampling from probability distributions (posteriors) of arm rewards.

MAP-Elites: Multi-dimensional Archive of Phenotypic Elites—an evolutionary algorithm that maintains a diverse collection of high-performing solutions.

GEPA: Generative Evolutionary Prompt Adaptation—a prior method extended by COMPASS.

MCP: Model Context Protocol—a standard for connecting AI assistants to systems and data (implied context for 'MCP server').