Distilling Feedback into Memory-as-a-Tool

📝 Paper Summary

Memory organization Self-evolving Agentic reasoning

The framework amortizes the high cost of iterative self-correction by distilling transient feedback into persistent, file-based guidelines that agents retrieve to improve future performance zero-shot.

Core Problem

Iterative self-correction ('System 2' reasoning) is computationally expensive and episodic; models 'forget' improvements once the context closes, forcing them to re-derive the same corrections for every new query.

Why it matters:

Massive redundancy occurs when models frequently re-derive the same insights for similar tasks, wasting compute
Fine-tuning is too costly and inflexible for adapting rapidly to user-defined rubrics
Standard context-based reasoning treats interactions in isolation, failing to consolidate 'lessons learned' over time

Concrete Example: A model fails to use 'synesthetic language' in a creative writing task. Standard methods critique and fix it, but the lesson is lost next time. This framework writes 'Prioritize synesthetic blending' to a file, enabling the model to retrieve and apply this rule zero-shot in future tasks.

Key Novelty

Memory-as-a-Tool

Treats memory as a file system managed via tools ('ls', 'read_file', 'write_file') rather than a passive vector database
Requires the LLM to actively 'abstract' raw feedback into generalizable principles before writing them to files
Amortizes the cost of reasoning: the expensive critique happens once, but the resulting 'lesson' file aids all future generations at low cost

Architecture

Diagrammatic representation of the complete generation and learning process

Evaluation Highlights

Matches or exceeds compute-heavy 'Self-Critique' performance after just 2 rounds of feedback on the Rubric Feedback Bench
Maintains performance on a long horizon of 12 mixed/interleaved tasks, accumulating 8 memory files without forgetting
Achieves a Pareto-optimal trade-off: comparable scores to iterative refinement but with inference costs much closer to the zero-shot baseline

Breakthrough Assessment

8/10

Novel framing of memory as an explicit file-system tool for consolidating abstract principles. Effectively solves the 'forgetting' problem of inference-time compute while drastically reducing costs.

⚙️ Technical Details

Problem Definition

Setting: Continual learning environment with sequential tasks where an agent must improve adherence to a rubric R over time

Inputs: Task prompt z, persistent memory state M (file system)

Outputs: Refined response x'

Pipeline Flow

Retrieval Group: Memory Listing (ls) → Memory Reading (read_file)
Generation Group: Draft Generation → Evaluator Feedback
Update Group: Abstraction → Conflict Resolution → Memory Writing (write_file/edit_file)

System Modules

Memory Manager

Identify and retrieve relevant past lessons before generation

Model or implementation: Claude Sonnet 4.5 / GPT-5.1 / Gemini 3 Pro

Generator (Generation Group)

Generate the response conditioned on task and retrieved memories

Model or implementation: Same as Memory Manager

Evaluator (Generation Group)

Simulate human feedback based on a private rubric

Model or implementation: Claude variant (different from agent)

Memory Writer

Consolidate transient feedback into persistent guidelines

Model or implementation: Same as Memory Manager

Novel Architectural Elements

File-based memory abstraction: Representing memory as human-readable files managed by explicit 'ls/read/write' tools rather than vector embeddings
Abstraction step: Explicitly prompting the model to convert specific errors into abstract 'Key Principles' before storage

Modeling

Base Model: Claude Sonnet 4.5, GPT-5.1, Gemini 3 Pro

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Refine: Persists critiques as 'lessons' for future tasks instead of discarding them after the episode
vs. RAG: Uses active tool-based management (read/write) of synthesized rules rather than passive semantic search over raw documents
vs. Fine-tuning: Updates an external memory store rather than model parameters, allowing rapid adaptation to new rubrics
+ 1 more
vs. MemGPT [not cited in paper]: Uses file-system metaphors for distilling abstract rules rather than managing raw context window overflow

Limitations

Retrieval relies on reasoning over filenames ('ls'), which may not scale to thousands of files without hierarchical traversal
Requires the model to have strong reasoning capabilities to synthesize abstract rules from raw feedback
Long-term memory management (active forgetting/pruning) is not yet implemented for lifetimes beyond 12 episodes

Reproducibility

Code: github.com/vicgalle/feedback-memory-as-a-tool

📊 Experiments & Results

Evaluation Setup

Continual learning simulation over sequential tasks (Horizon H=3 and H=12)

Benchmarks:

Rubric Feedback Bench (Open-ended writing with structured rubrics) [New]

Metrics:

Rubric Score (based on multi-dimensional criteria)
Inference Cost (token count/compute)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Quantitative scores are presented in charts (Figure 2) without raw numbers in the text, but specific parameters for the long-horizon experiment are reported.
Rubric Feedback Bench (Mixed Task)	Files Accumulated	0	8	+8

Experiment Figures

Mean scores across three models (Claude Sonnet 4.5, GPT-5.1, Gemini 3 Pro) over sequential task episodes

Pareto frontier comparing Average Cost per Task vs Average Score for Claude Sonnet 4.5

Main Takeaways

The Memory + Feedback approach matches or exceeds the performance of expensive inference-time 'Self-Critique' baselines after just two rounds of feedback.
Cost efficiency follows a Pareto optimal curve: the method achieves high scores comparable to self-correction but with inference costs much closer to the cheap zero-shot baseline.
Claude Sonnet 4.5 demonstrated the steepest learning curve, suggesting superior reasoning in synthesizing abstract rules.
The framework successfully generalizes across a long horizon of 12 interleaved tasks (mixed types), maintaining robust knowledge consolidation without significant interference.

📚 Prerequisite Knowledge

Prerequisites

Understanding of 'System 2' reasoning (iterative refinement)
Familiarity with LLM tool use (function calling)
Basic concepts of retrieval-augmented generation (RAG)

Key Terms

System 2 scaling: Trading test-time computation (e.g., via iterative self-correction or search) for higher accuracy, mimicking deliberate human reasoning

Amortization: Spreading the initial high cost of a resource (here, the compute for generating a critique) over many future uses, reducing the average cost per use

Rubric: A scoring guide used to evaluate performance, consisting of multi-dimensional criteria and behavioral descriptors

Zero-shot: Attempting a task without providing specific examples in the prompt; here, generating a response using only the retrieved memory guidelines

Episodic nature: The limitation where an AI model treats every interaction as a blank slate, forgetting previous context once the session ends

Tool calling: The capability of an LLM to invoke external functions (like 'write_file') to perform actions outside its text generation loop