HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research

📝 Paper Summary

Self-evolving Agentic reasoning Agentic healthcare data analysis

HealthFlow is an autonomous healthcare agent that improves its high-level research strategies over time by distilling successful task executions into a structured experience memory.

Core Problem

Current AI agents rely on static, hard-coded strategic frameworks for task decomposition, preventing them from learning how to orchestrate complex healthcare workflows or adapt plans based on previous failures.

Why it matters:

Healthcare research involves open-ended problems and noisy data where rigid, predefined strategies often fail to adapt to intermediate findings
Existing agents optimize tool usage (component-level) but cannot refine their overarching management policy, limiting autonomy in high-stakes domains
The lack of meta-level learning means agents repeat strategic errors rather than accumulating procedural wisdom like human researchers

Concrete Example: In a data visualization task involving blood pressure, a standard agent creates a plot immediately, ignoring outliers that distort the scale. HealthFlow, recalling a 'warning' experience from a prior task, proactively inserts a data filtering step to remove unrealistic values before plotting, ensuring interpretability.

Key Novelty

Meta-Level Evolution via Structured Experience Memory

Treats every completed task as a learning opportunity by reflecting on execution traces to synthesize durable 'experiences' (heuristics, code snippets, warnings)
Updates the agent's high-level planning policy by retrieving these structured experiences for new tasks, allowing it to start with better strategies rather than just better tools
Decouples strategic evolution (learning *how* to plan) from static execution, moving beyond simple tool-library expansion found in prior work

Architecture

The self-evolving architecture of HealthFlow, detailing the interaction between the four agents (Meta, Executor, Evaluator, Reflector) and the Experience Memory.

Evaluation Highlights

+15.99pp success rate on MedAgentBoard (81.89% vs 65.90% for next best) when using ToolUniverse
Achieves 3.98/5.0 on the new EHRFlowBench, significantly outperforming general agent AFlow (3.31) and biomedical agent STELLA (2.39)
Dominant win rate in head-to-head comparisons on EHRFlowBench (e.g., >90% win rate against Biomni and STELLA)

Breakthrough Assessment

8/10

Significant step in self-evolving agents by formalizing meta-level strategic learning rather than just tool tuning. Strong empirical gains on complex tasks, though reliant on closed-source LLM backbones.

⚙️ Technical Details

Problem Definition

Setting: Autonomous execution of healthcare research tasks T to produce solution S via iterative planning and execution

Inputs: Research task description T (e.g., 'Develop a model to predict patient outcomes using MIMIC-IV')

Outputs: Solution artifacts S (code, reports, visualizations, trained models) and updated experience memory M

Pipeline Flow

Experience Retrieval (fetch relevant past insights)
Meta Agent (generate strategic plan)
Executor Agent (execute plan via code/tools)
Evaluator Agent (score & feedback)
Reflector Agent (synthesize new experiences if successful)

System Modules

Experience Retrieval (Planning & Memory)

Augment context with relevant past lessons

Model or implementation: LLM-based re-ranker (DeepSeek-V3)

Meta Agent (Planning & Memory)

Cognitive orchestrator; generates high-level strategic plans

Model or implementation: DeepSeek-V3 (or DeepSeek-R1 in variants)

Executor Agent

Grounds plans into actions via CodeAct and ToolUniverse

Model or implementation: DeepSeek-V3 (or Claude Code in variants)

Evaluator Agent (Evaluation & Reflection)

Short-term corrector; critiques execution against requirements

Model or implementation: DeepSeek-V3

Reflector Agent (Evaluation & Reflection)

Long-term knowledge synthesizer; distills durable insights

Model or implementation: DeepSeek-V3

Novel Architectural Elements

Meta-level evolution loop: The feedback loop does not just correct the *current* task but updates a persistent 'Experience Memory' that modifies the *planning policy* for all future tasks
Structured Experience Memory: Stores insights as typed records (heuristic, code_snippet, workflow_pattern, warning) rather than raw text logs, enabling precise retrieval

Modeling

Base Model: DeepSeek-V3 (main experiments) and DeepSeek-R1 (variants)

Training Method: In-context learning and memory accumulation (no gradient updates to model weights)

Adaptation: None (uses off-the-shelf LLMs via API)

Trainable Parameters: 0 (System evolves via external memory M)

Training Data:

Bootstrapping Phase: 10 tasks from EHRFlowBench and 10 from CureBench used to pre-populate memory
Evaluator uses ground truth reference answers during this phase to ensure high-quality initial experiences

Key Hyperparameters:

retrieval_k: 5
success_threshold_theta: 6.0 (out of 10)
max_retries: 3

Compute: Experiments run on Mac Studio M3 Ultra (512GB RAM). Inference only.

Comparison to Prior Work

vs. Biomni: HealthFlow evolves its planning strategy via reflection, whereas Biomni uses a static loop
vs. STELLA: HealthFlow evolves high-level orchestration/strategy, whereas STELLA optimizes component-level templates and tools
vs. AFlow: HealthFlow uses memory-based experience retrieval for efficiency, whereas AFlow relies on computationally intensive search (MCTS) for workflow generation
+ 2 more
vs. TextGrad [not cited in paper]: TextGrad backpropagates feedback to optimize prompts, whereas HealthFlow accumulates feedback into a retrieval-based memory system
vs. Voyager [not cited in paper]: Similar use of code-based skill library in Minecraft, but HealthFlow adds 'heuristics' and 'warnings' specifically for the stochastic nature of clinical data analysis

Limitations

Performance is tethered to the underlying LLM's capabilities (biases in LLM propagate to plans)
Risk of synthesizing flawed heuristics from idiosyncratic successes, potentially degrading future performance
Experience retrieval relies on semantic similarity, which may miss relevant strategies if tasks are phrased differently
Relies on closed-source models (DeepSeek, Claude) for best performance

Reproducibility

Code: https://github.com/githubCode/daabaseDataset

Available: EHRFlowBench dataset construction methodology described in detail. Missing: Code repository link is a placeholder. Prompt templates not explicitly provided in main text. Closed-source model dependencies: System relies heavily on DeepSeek-V3/R1 and Claude Code for execution.

📊 Experiments & Results

Evaluation Setup

End-to-end autonomous execution of healthcare research tasks

Benchmarks:

EHRFlowBench (Complex, open-ended health data analysis tasks derived from papers) [New]
MedAgentBoard (Structured EHR data analysis (MIMIC-IV, TJH))
MedAgentsBench (Medical knowledge reasoning (QA))
Humanity’s Last Exam (HLE) (Expert-level medical reasoning)
CureBench (Tool-augmented clinical reasoning)

Metrics:

Success Rate (%)
LLM-as-a-judge Score (1-5 scale)
Accuracy (%)
Statistical methodology: Bootstrapping on test set samples 100 times to report mean and standard deviations

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HealthFlow demonstrates superior performance on complex data analysis benchmarks (EHRFlowBench and MedAgentBoard) compared to general and domain-specific baselines.
EHRFlowBench	LLM Score (1-5)	3.31	3.98	+0.67
EHRFlowBench	LLM Score (1-5)	2.39	3.98	+1.59
MedAgentBoard	Success Rate (%)	45.61	81.89	+36.28
Ablation studies confirm the critical role of feedback and experience memory in the system's performance.
EHRFlowBench	LLM Score (1-5)	2.78	3.82	+1.04
EHRFlowBench	LLM Score (1-5)	3.63	3.82	+0.19
Performance on knowledge-intensive QA benchmarks shows competitive but less dramatic gains, as these tasks require less complex planning.
MedAgentsBench	Accuracy (%)	30.30	30.68	+0.38

Experiment Figures

Head-to-head win rates of HealthFlow vs. baselines (AFlow, Alita, Biomni, STELLA) on EHRFlowBench and MedAgentBoard.

Distribution of synthesized and retrieved experience types (heuristic, code_snippet, workflow_pattern, warning) across benchmarks.

Main Takeaways

Meta-level evolution is most effective for complex, multi-step workflows (EHR analysis) rather than static QA tasks
The integration of a rich tool ecosystem (ToolUniverse) combined with strategic planning yields the highest performance (+15% success rate boost)
Dynamic adaptation observed: Agent prioritizes 'code snippets' for novel open-ended tasks but relies on 'heuristics' for structured, routine pipelines
Robustness: HealthFlow maintains high success rates even when baselines fail due to lack of data validation (e.g., handling outliers in visualization)

📚 Prerequisite Knowledge

Prerequisites

Understanding of agentic workflows (planning, execution, reflection)
Familiarity with RAG (Retrieval-Augmented Generation) concepts
Basic knowledge of clinical data analysis (EHRs, predictive modeling)

Key Terms

Meta-Level Learning: Learning how to learn or strategize; here, it means improving the high-level planning process itself rather than just optimizing specific low-level tool calls

EHR: Electronic Health Record—digital version of a patient's paper chart, containing medical history, diagnoses, medications, etc.

MCP: Model Context Protocol—a standard for connecting AI assistants to systems and tools (e.g., databases, local files)

Reflector Agent: A specific agent role responsible for analyzing past execution traces to distill abstract lessons (experiences) for future use

CodeAct: A framework where agents execute actions by writing and running code (usually Python) rather than calling rigid APIs, allowing for more flexible problem solving

MIMIC-IV: Medical Information Mart for Intensive Care IV—a large, freely available database of de-identified health data widely used for critical care research

SFT: Supervised Fine-Tuning—training a model on labeled examples

Cold-Start Problem: The difficulty of an adaptive system performing well before it has accumulated enough data or experience; addressed here by pre-populating memory with training tasks

Heuristic: A rule-of-thumb or strategic guideline synthesized from experience (e.g., 'Always check for missing values before training')

Trajectory: The sequence of actions, observations, and thoughts generated by an agent during the execution of a task