← Back to Paper List

PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

Arash Shahmansoori
arXiv (2026)
Memory Agent RAG Reasoning

📝 Paper Summary

Self-evolving Agentic reasoning Memory organization
PRECEPT enables LLM agents to adapt to drift and compose rules deterministically by combining exact-match retrieval, Bayesian conflict resolution, and evolutionary prompt optimization into a unified test-time framework.
Core Problem
LLM agents relying on verbal memory suffer from retrieval errors that scale exponentially with condition count, struggle to compose atomic rules, and fail to detect stale knowledge under environmental drift.
Why it matters:
  • Current verbal reflection methods degrade to ~94% error rates when interpreting complex conditions (N=10), making them unreliable for complex tasks.
  • Reinforcement learning is too sample-inefficient for deployment (requiring >100 samples) and requires retraining for drift.
  • Static agents fail significantly (72% failure rate) when environment dynamics change, requiring systems that can survive and adapt online.
Concrete Example: In a logistics task with 10 conditions, a standard verbal reflection agent attempting to retrieve relevant rules suffers a 94.4% interpretation error rate due to partial matching. PRECEPT uses structured keys to guarantee 0% retrieval error on the deterministic path.
Key Novelty
Unified Framework for Deterministic Adaptation (PRECEPT)
  • Replaces fuzzy natural language retrieval with deterministic exact-match lookup using structured keys, enabling reliable rule stacking via a semantic hierarchy.
  • Treats memory conflicts as a reliability problem using Bayesian tracking to distinguish between temporary outliers and genuine environmental drift.
  • Optimizes agent prompts using an evolutionary outer loop (COMPASS) that selects based on a Pareto frontier of success and efficiency, rather than just gradients or heuristics.
Evaluation Highlights
  • +41.1pp first-try success advantage over Full Reflexion (d>1.9 difficulty) across 9-10 seeds.
  • +55.0pp recovery from environmental drift (d=0.95, p=0.031) compared to baselines.
  • 100% P1 score on 2-way logistics compositional tasks (d=2.64), demonstrating reliable rule composition.
Breakthrough Assessment
9/10
Addresses critical reliability bottlenecks in agents (determinism, drift, compositionality) with a theoretically grounded, unified architecture. Strong empirical gains (+41pp) and novel integration of evolutionary methods.
×