AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

📝 Paper Summary

Agent Evolution Memory Organization

Memento enables LLM agents to continuously adapt and improve on deep research tasks via a growing episodic memory and a learned case-retrieval policy, eliminating the need for expensive parameter fine-tuning.

Core Problem

Current LLM agents either rely on rigid, static workflows that cannot adapt, or require computationally expensive fine-tuning (SFT/RL) to update model parameters, which is inefficient for open-ended continuous learning.

Why it matters:

Continuous adaptation is essential for generalist agents in changing environments, but frequent retraining is cost-prohibitive
Static prompting strategies fail to incorporate online feedback from successes and failures
Existing memory systems often suffer from retrieval swamping without selective curation mechanisms

Concrete Example: In a deep research scenario, an agent might fail a complex web-search task. A standard agent repeats the mistake or requires a full model update to learn. Memento stores the failure trace in memory; when a similar task appears, it retrieves the failure case to guide the planner away from the previous error.

Key Novelty

Memory-augmented MDP with Neural Case-Selection Policy

Formalizes agent planning as a Memory-augmented MDP where the state includes both the environment status and a retrieval-based case bank
Optimizes a 'case retrieval policy' using online reinforcement learning (Soft Q-Learning) to select the most useful past experiences (successes or failures) for the current context
Updates the retrieval mechanism (Q-function) rather than the LLM parameters, allowing the agent to 'learn on the fly' by curating its episodic memory

Architecture

The dual-stage architecture of Memento, alternating between Case-Based Planning and Tool-Based Execution.

Evaluation Highlights

Attains top-1 on GAIA validation with 87.88% Pass@3 and 79.40% on the private test leaderboard
Outperforms state-of-the-art training-based methods on DeepResearcher dataset, achieving 66.6% F1 and 80.4% PM
Case-based memory adds 4.7 to 9.6 absolute percentage points on out-of-distribution tasks compared to baselines

Breakthrough Assessment

9/10

Achieves SOTA on the challenging GAIA benchmark without fine-tuning, offering a scalable, non-parametric alternative to costly agent training. The formalization of retrieval as an RL policy over memory is a significant methodological advance.

⚙️ Technical Details

Problem Definition

Setting: Memory-Based Markov Decision Process (M-MDP)

Inputs: Task instruction and current environment state s

Outputs: Action a (plan generation) conditioned on retrieved case c

Pipeline Flow

Input Processing
Case-Based Planning (Planner + Memory)
Tool-Based Execution (Executor + Tools)
Memory Update

System Modules

Planner (Case-Based Planning)

Decompose the high-level task into subtasks using retrieved cases as context

Model or implementation: GPT-4o

Case Memory (Parametric/Non-Parametric) (Case-Based Planning)

Store and retrieve past trajectories (state, action, reward)

Model or implementation: Vector store + Q-Network (for parametric)

Executor

Execute generated subtasks using external tools

Model or implementation: o3-mini (for GAIA) / o4-mini (default)

Novel Architectural Elements

Integration of a learnable 'retrieval policy' (Q-function) directly into the planning loop, treating case selection as an RL action
Separation of Planner (CBR-based) and Executor (Tool-based) where only the Planner's context is actively optimized via memory selection

Modeling

Base Model: GPT-4o (Planner), o3-mini/o4-mini (Executor)

Training Method: Online Soft Q-Learning for the Retrieval Policy (NOT the LLM)

Objective Functions:

Purpose: Maximize expected reward and entropy of case selection.

Formally: J(π) = Σ E[r(s, a) + αH(π(·|s))]
Purpose: Minimize TD error for Q-function (parametric memory).

Formally: Binary Cross Entropy loss approximating the likelihood that a retrieved case leads to success (r=1)

Adaptation: None (LLM is frozen)

Training Data:

Online experience collection: M_t+1 = M_t U {(s_t, a_t, r_t)}

Key Hyperparameters:

gamma: Discount factor (standard RL)
alpha: Entropy weight
eta: Learning rate for Q-function

Compute: Low-cost (updates small Q-network only); Inference uses API calls to GPT-4o/o3-mini

Comparison to Prior Work

vs. Reflexion: Memento updates a retrieval policy analytically via RL rather than just appending text reflections
vs. RA-DIT: Memento freezes the LLM and only learns the retrieval function, avoiding expensive LLM fine-tuning
vs. Mem0: Memento uses reward-driven Q-learning for memory selection, whereas Mem0 focuses on structured CRUD operations

Limitations

Dependency on proprietary closed-source models (GPT-4o, o3-mini) for the underlying planner/executor
Performance gain depends on the availability of relevant past cases (cold start problem)
Binary reward signal (success/fail) may be sparse for very long-horizon tasks

Reproducibility

Code: https://github.com/Agent-on-the-Fly/Memento

Code is publicly available at https://github.com/Agent-on-the-Fly/Memento. The paper details the MDP formulation and memory update rules. Base models are proprietary (GPT-4o, o3-mini).

📊 Experiments & Results

Evaluation Setup

Deep research and long-horizon agentic tasks

Benchmarks:

GAIA (Long-horizon tool use)
DeepResearcher (Real-time web research)
SimpleQA (Factual precision)
HLE (Long-tail academic reasoning)

Metrics:

Pass@3
F1 Score
Process Match (PM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GAIA Validation	Pass@3	Not reported in the paper	87.88	Not reported in the paper
DeepResearcher	F1	Not reported in the paper	66.6	Not reported in the paper
DeepResearcher	PM	Not reported in the paper	80.4	Not reported in the paper
Out-of-distribution tasks	Absolute Improvement	Not reported in the paper	Not reported in the paper	+4.7 to +9.6
SimpleQA	PM	Not reported in the paper	95.0	Not reported in the paper

Main Takeaways

Parametric memory retrieval driven by Soft Q-Learning allows the agent to distinguish between high-utility and low-utility cases, significantly boosting performance over static retrieval.
The system demonstrates strong generalization to out-of-distribution tasks, leveraging the case bank to find relevant precedents even for unseen scenarios.
Memento achieves state-of-the-art results on GAIA and DeepResearcher without any gradient updates to the underlying LLM, proving the efficacy of memory-based adaptation.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP)
Reinforcement Learning (specifically Soft Q-Learning)
Case-Based Reasoning (CBR)
Retrieval-Augmented Generation (RAG)

Key Terms

M-MDP: Memory-augmented Markov Decision Process—an extension of MDPs where the state space includes an evolving external memory of past experiences

CBR: Case-Based Reasoning—a problem-solving paradigm that solves new problems by retrieving and adapting solutions from similar past problems

Pass@3: A metric measuring the probability that at least one of the top 3 generated solutions is correct

PM: Process Match—a metric likely measuring how closely the agent's execution path aligns with a reference or desired workflow (specific definition not fully elaborated in text snippet but implied as a performance metric)

MCP: Model Context Protocol—a standardized interface for connecting AI models to external tools and data sources

Soft Q-Learning: An RL algorithm that maximizes both the expected reward and the entropy of the policy, encouraging exploration and robustness

Episodic Control: A learning method that rapidly estimates values (Q-values) based on highly similar past events stored in memory, rather than slow gradient updates

TD learning: Temporal Difference learning—an RL method that updates estimates based on other learned estimates, bootstrapping from the future to the present