Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

📝 Paper Summary

Agentic RAG pipeline Self-evolving Agentic reasoning Cross-framework knowledge transfer

AGENTKB is a universal memory infrastructure that allows heterogeneous agent frameworks to share and reuse problem-solving experiences without retraining, using a two-stage retrieval process for planning and feedback.

Core Problem

Current AI agent frameworks operate in silos with incompatible memory systems, forcing agents to rediscover solutions and repeat mistakes rather than learning from collective experience.

Why it matters:

Representation heterogeneity prevents transferring effective solutions between different tools (e.g., smolagents vs. OpenHands)
Context mismatch causes valid solutions in one environment to fail in another due to API or reasoning protocol differences
Knowledge interference risks destabilizing an agent's reasoning flow when naively injecting external execution traces

Concrete Example: In a PDB protein distance calculation, a standard agent naively reads the first two lines of a file, selecting solvent records and calculating a spurious distance (0.961 Å). It fails to learn from past correct workflows that filter for 'ATOM' entries and sanity-check bond lengths.

Key Novelty

Universal Cross-Framework Memory Layer

Abstracts execution traces from diverse frameworks (smolagents, OpenHands, etc.) into a unified, framework-agnostic schema containing constraints, action-reasoning pairs, and metadata
Implements a 'Reason-Retrieve-Refine' loop at two distinct stages: initially to seed planning with past workflows, and subsequently to inject targeted fixes based on execution feedback
Uses a 'disagreement gate' during feedback integration to ensure retrieved knowledge corrects rather than disrupts the agent's reasoning process

Architecture

The end-to-end workflow of AGENTKB, covering construction, evolution, and the two-stage inference process.

Evaluation Highlights

+18.7pp improvement on GAIA pass@3 (55.2% → 73.9%) for smolagents using AGENTKB
+17.0pp improvement on SWE-bench Lite pass@100 (28.7% → 45.7%) for OpenHands using AGENTKB
On Humanity's Last Exam (Bio/Chem), OpenHands improves from 9.5% to 14.1% pass@3, outperforming specialized systems like Biomni

Breakthrough Assessment

9/10

First system to demonstrate effective zero-shot knowledge transfer across completely different agent architectures (e.g., transferring knowledge from a coding agent to a reasoning agent) with substantial gains.

⚙️ Technical Details

Problem Definition

Setting: General agentic problem solving across diverse domains (reasoning, coding, science) using heterogeneous agent frameworks

Inputs: Task description (natural language) and optional execution feedback

Outputs: Executable plan or refined solution adapted to the specific agent framework

Pipeline Flow

Experience Abstraction (Trace → Structured Experience)
Planning Stage: Task → Reason → Retrieve → Refine → Plan
Execution (by base agent)
Feedback Stage: Trace/Error → Reason → Retrieve → Refine (via Disagreement Gate) → Fix

System Modules

Experience Abstractor

Converts raw execution logs into structured experiences

Model or implementation: LLM-based abstraction (implied)

Hybrid Retriever

Retrieves relevant experiences using both text and semantic similarity

Model or implementation: BM25 (lexical) + all-MiniLM-L6-v2 (semantic)

Refiner

Adapts retrieved experiences to the current agent's specific tools and APIs

Model or implementation: Base LLM (e.g., GPT-4.1, Claude-3.7)

Disagreement Gate

Filters feedback refinements to prevent unstable updates

Model or implementation: Cosine similarity check on embeddings

Novel Architectural Elements

Disagreement gate mechanism for filtering feedback-driven plan updates
Dual-stage Reason-Retrieve-Refine loop applied to both planning (task-based) and feedback (trace-based) phases
Framework-agnostic experience schema decoupling logic from specific agent implementations

Modeling

Base Model: Evaluated with GPT-4o, GPT-4.1, Claude-3.7, Qwen-3 32B, DeepSeek-R1, o3-mini

Training Method: In-context learning / Retrieval-Augmented Generation only (no fine-tuning of base models reported)

Adaptation: None (Plug-and-play memory layer)

Training Data:

Bootstrapped with 80 human seed trajectories
Expanded via automatic rollouts on BrowseComp, HopRAG, HLE, WebWalkerQA, RepoClassBench, SWE-Gym-Raw, RepoEval
Total ~9k workflow summaries and 7k execution snippets

Key Hyperparameters:

retrieval_top_k: 3
hybrid_fusion_alpha: 0.5
disagreement_gate_beta: 0.8
+ 2 more
deduplication_threshold_tau: 0.8
temperature: 1.0

Compute: Not reported in the paper

Comparison to Prior Work

vs. A-Mem: Enables cross-framework transfer (e.g., OpenHands to smolagents) vs. single-framework only
vs. MemoryBank: Uses hybrid retrieval and disagreement gating vs. graph-based storage [not cited in paper]
vs. ReadAgent: Focuses on execution traces and workflows vs. reading comprehension/summary memory

Limitations

Software engineering knowledge does not generalize well to reasoning tasks (asymmetric transferability)
Quality of abstraction is a bottleneck for complex tasks; simply increasing store size yields diminishing returns for hard reasoning
Perception errors (image/video understanding) remain constrained by the underlying tool capabilities, though reduced by better planning

Reproducibility

Code: https://github.com/OPPO-PersonalAI/Agent-KB

Code is publicly available at https://github.com/OPPO-PersonalAI/Agent-KB. The paper details the datasets used for construction (GAIA, HLE, etc.) and the bootstrapping process using 80 seed trajectories.

📊 Experiments & Results

Evaluation Setup

Multi-pass evaluation (pass@1, pass@2 with feedback, pass@3 with expanded retrieval) on reasoning and coding benchmarks

Benchmarks:

GAIA (General AI Assistants (Levels 1-3))
SWE-bench Lite (Software Engineering (GitHub issues))
Humanity's Last Exam (HLE) (Multi-modal scientific reasoning (Bio/Chem subset))
GPQA (Graduate-level science QA)

Metrics:

pass@1
pass@2
pass@3
Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GAIA Benchmark results showing improvements across different models and difficulty levels.
GAIA	pass@3	55.2	73.9	+18.7
GAIA	pass@3	58.8	75.2	+16.4
GAIA Level 2	pass@3	53.5	73.3	+19.8
GAIA	pass@3	43.6	63.6	+20.0
SWE-bench Lite results demonstrating efficacy in software engineering tasks.
SWE-bench Lite	Success Rate (100 iter)	28.7	45.7	+17.0
SWE-bench Lite	Success Rate (50 iter)	30.0	51.0	+21.0
Scientific reasoning benchmarks (HLE and GPQA).
HLE (Bio/Chem)	pass@3	9.5	14.1	+4.6
GPQA	pass@3	62.6	72.7	+10.1
Ablation studies validating system components.
GAIA	Average Accuracy	61.21	55.15	-6.06

Experiment Figures

Ablation analysis plots for disagreement gate threshold, retrieval strategy, and knowledge base size.

Main Takeaways

Consistent gains across all frameworks (smolagents, OWL, SWE-Agent, OpenHands) and models (GPT, Claude, Qwen, DeepSeek), proving universal applicability.
Hybrid retrieval outperforms single-modality retrieval, especially at top-k=3, balancing precise tool matches with conceptual similarity.
Asymmetric transfer: Reasoning experiences transfer partially to coding tasks, but coding experiences generalize poorly to reasoning tasks.
Automatically generated experiences perform on par with or better than human-curated ones, particularly on harder tasks (GAIA Level 3, SWE-bench).

📚 Prerequisite Knowledge

Prerequisites

Understanding of AI agent frameworks (OpenHands, smolagents)
Retrieval-Augmented Generation (RAG) concepts
Basic knowledge of vector embeddings and similarity search

Key Terms

AGENTKB: A universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining

disagreement gate: A mechanism that selectively integrates feedback only when the refined plan significantly differs from the original plan (based on embedding similarity), ensuring stability

pass@k: An evaluation metric measuring the percentage of problems solved correctly given k attempts

hybrid retrieval: A search strategy combining lexical (keyword-based, e.g., BM25) and semantic (embedding-based) similarity scores

heterogeneous agent frameworks: Different software architectures for building AI agents (e.g., smolagents, OpenHands) that typically have incompatible internal representations

smolagents: A lightweight library for building agentic systems

OpenHands: An open-source platform for software development agents

GAIA: A benchmark for General AI Assistants covering reasoning and tool use

SWE-bench: A benchmark for evaluating large language models on software engineering tasks via GitHub issues

HLE: Humanity's Last Exam—a difficult multi-modal benchmark for reasoning

GPQA: A challenging dataset of graduate-level questions in biology, physics, and chemistry