A co-evolving agentic AI system for medical imaging analysis

📝 Paper Summary

Medical Agentic AI Tool-use and Workflow Planning

TissueLab is an agentic AI system that orchestrates specialized medical tools to build executable imaging workflows, utilizing clinician feedback and active learning to refine lightweight models in real-time.

Core Problem

Medical image analysis requires highly specialized, manually constructed pipelines that VLMs cannot generate reliably due to hallucination, lack of specific quantification tools, and inability to adapt to new disease morphologies without retraining.

Why it matters:

Clinicians rely on precise quantifications (e.g., tumor-to-duct ratio) for staging and treatment, which general VLMs fail to provide accurately
Existing agentic systems rely on fixed toolboxes that become obsolete and lack mechanisms for real-time expert refinement or preference retention
The gap between AI research and clinical adoption is widened by 'black box' models that cannot be inspected, corrected, or grounded in authoritative guidelines

Concrete Example: When asked to calculate tumor invasion depth, GPT-4o-vision produces a hallucinatory estimate with poor correlation (Pearson ρ=0.37) because it cannot perform precise geometric measurement. TissueLab constructs a workflow to segment tissue, extract contours, and compute the exact distance, achieving expert-level correlation (Pearson ρ=0.843).

Key Novelty

Co-evolving Agentic Ecosystem (TissueLab)

Modules are 'co-evolving': clinician feedback on intermediate results (e.g., segmentation errors) is immediately converted into training data for lightweight model fine-tuning via active learning
Combines LLM orchestration with a 'Factory Method' architecture where diverse specialized models are wrapped as standardized plugins, enabling dynamic workflow planning
Integrates the Model Context Protocol (MCP) to retrieve live, authoritative clinical guidelines (e.g., AJCC) to ground diagnostic reasoning in external standards rather than model weights

Architecture

The TissueLab ecosystem architecture, illustrating the flow from user query to tool selection, workflow generation, distributed inference, and feedback loops.

Evaluation Highlights

Achieved 99.8% accuracy in prostate tumor-to-duct ratio measurement after 2 minutes of active learning feedback, compared to <12% for most baselines
Attained 0.843 Pearson correlation with expert annotations for tumor invasion depth, significantly outperforming GPT-4o-agent (0.376)
Raised mean AUC from 0.6959 to 0.8284 on NIH Chest X-ray classification by leveraging candidate pooling and clinician preference updates

Breakthrough Assessment

9/10

A major step forward in medical agents, moving beyond simple VLM prompting to a system that builds executable code workflows, integrates real-time active learning, and grounds decisions in retrieval-augmented guidelines.

⚙️ Technical Details

Problem Definition

Setting: Automated medical image analysis via natural language queries, requiring multi-step workflow generation, execution, and expert-guided refinement

Inputs: Natural language query q, Medical image I (2D/3D/4D), Optional clinician feedback

Outputs: Executable workflow W, Quantitative analysis results R, Natural language summary S, Visualizations V

Pipeline Flow

Entrance Agent (Receives query)
Workflow Agent (Selects tools & plans DAG)
Execution Engine (Runs tools with parallelization)
Code Analysis Agent (Computes metrics on tool outputs)
Summary Agent (Generates final answer)
Feedback Loop (Optional active learning update)

System Modules

Workflow Agent

Selects appropriate AI tools from the factory and constructs a dependency graph

Model or implementation: LLM (Specific model not stated, likely GPT-4 based on context)

Tool Factories

Standardized wrappers for domain-specific models (Pathology, Radiology, Omics)

Model or implementation: Various domain-specific models (e.g., TotalSegmentator, NuClass)

Memory Layer

Stores intermediate results and clinician feedback for transparency and training

Model or implementation: HDF5 / Local Data Container

Code Analysis Agent

Generates and executes Python code to derive metrics from tool outputs

Model or implementation: LLM code generator

Novel Architectural Elements

Co-evolving feedback loop: Clinician corrections on visual outputs directly trigger lightweight fine-tuning of downstream modules via active learning
Factory-based abstraction: Encapsulates diverse medical AI models into unified interfaces, allowing the agent to swap tools without code changes
Editable Memory Layer: Persists all intermediate states (masks, arrays) to allow both user visualization/correction and agent reuse

Modeling

Base Model: Foundation models vary by task (e.g., NuClass for pathology, TotalSegmentator for radiology, proprietary LLM for orchestration)

Training Method: Active Learning for lightweight classifier adaptation

Adaptation: Lightweight fine-tuning of task heads (e.g., cell classifiers) based on user feedback

Training Data:

User-provided annotations during the 'co-evolving' feedback phase

Compute: Adaptation occurs in 'minutes' (e.g., 2 minutes for prostate task, 10-30 minutes for colon task). Inference time: 10 seconds per iteration for feedback refinement.

Comparison to Prior Work

vs. VLMs (Quilt-LLaVA, MedGemma): TissueLab generates executable code and uses specialized tools rather than relying on end-to-end pixel-to-text generation
vs. Standard Agents (GPT-4o-agent): TissueLab incorporates a 'co-evolving' loop where user feedback fine-tunes tools in real-time, and uses MCP for guideline retrieval
vs. M3D: TissueLab handles 3D/4D data by decomposing tasks into segmentation and geometric analysis rather than end-to-end processing
+ 1 more
vs. Med-Flamingo [not cited in paper]: Med-Flamingo uses few-shot prompting for adaptation; TissueLab uses active learning to update model weights

Limitations

Performance is bottlenecked by the quality of underlying tools (e.g., segmentation models) if they fail completely
Requires clinician-in-the-loop for the 'co-evolving' benefits, which consumes expert time
Specific details on the orchestrator LLM prompts and costs are not fully detailed in the text
Limited to the tools currently integrated into the factory ecosystem (though extensible)

Reproducibility

Code: https://tissuelab.org

Code and ecosystem released at tissuelab.org. System available for Windows, macOS, Linux. Specific hyperparameters for the underlying foundation models (NuClass) or the orchestrator LLM (e.g. temperature) are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

Comparative analysis across Pathology (2D), Radiology (3D/4D), and Spatial Omics tasks using public and competition datasets.

Benchmarks:

LNCO2 (Pathology) (Tumor invasion depth & Lymph node metastasis count)
NIH Chest X-ray (Thoracic disease classification)
UNIFESP Chest CT (Fatty liver diagnosis (3D))
PhysioNet ICH (Intracranial hemorrhage detection (3D))
Visium HD (Colon/Prostate) (Cell quantification (Spatial Omics))

Metrics:

Accuracy
F1 score (weighted)
Mean Absolute Error (MAE)
Pearson Correlation (ρ)
AUC
Task Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pathology: Tumor Invasion Depth (DoI) prediction on LNCO2 dataset. TLAgent significantly outperforms VLMs in correlation with expert ground truth.
LNCO2 (DoI)	Pearson Correlation (ρ)	0.376	0.843	+0.467
LNCO2 (DoI)	MAE (mm)	78.183	2.047	-76.136
Radiology: 3D/4D analysis tasks (Fatty Liver, ICH, Cardiac MRI). TLAgent succeeds where standard VLMs fail due to dimensionality.
UNIFESP Chest CT (Fatty Liver)	F1 score	0.788	0.870	+0.082
PhysioNet ICH (Hemorrhage)	Accuracy	0.573	0.787	+0.214
Co-evolution: Adaptive learning capability on spatial omics tasks using clinician feedback.
Visium HD (Prostate)	Accuracy	0.039	0.998	+0.959
NIH Chest X-ray	Mean AUC	0.6959	0.8284	+0.1325

Experiment Figures

Comparison of Tumor Invasion Depth (DoI) prediction on Colon Cancer WSI between TLAgent and baselines (GPT-4o, MedGemma, etc.).

Performance curves for the 'co-evolving' capability on cell classification tasks (Colon and Prostate).

Main Takeaways

Orchestration beats End-to-End: Decomposing medical tasks into workflow steps (segmentation -> measurement) yields valid results where VLMs hallucinate.
Guideline Retrieval: Using MCP to fetch external criteria (e.g., AJCC staging) ensures diagnoses align with standards, unlike static model knowledge.
Real-time Adaptation: The 'co-evolving' feature allows the system to go from <5% to >90% accuracy on novel tasks (like cell counting) in minutes via user feedback.
Transparency: Storing intermediate results (segmentation masks) in the memory layer allows verification, building trust compared to black-box VLM outputs.

📚 Prerequisite Knowledge

Prerequisites

Principles of Agentic AI (planning, tool use)
Medical Imaging modalities (Pathology WSI, CT, MRI)
Active Learning concepts
Basic computer vision tasks (Segmentation, Classification)

Key Terms

WSI: Whole-Slide Image—high-resolution digital pathology scans often gigapixels in size

MCP: Model Context Protocol—a standard used here to dynamically retrieve external clinical guidelines (e.g., AJCC staging) to ground model decisions

Active Learning: A machine learning strategy where the algorithm queries the user to label new data points that are most informative, allowing rapid adaptation with few examples

DoI: Depth of Invasion—a specific histopathological measurement critical for tumor staging

Topological Sorting: An algorithm used here to organize dependent tasks in the generated workflow graph to ensure correct execution order and parallelization

VLM: Vision-Language Model—AI models capable of processing both images and text

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Factory Method: A design pattern used here to abstract diverse imaging tools into standardized operations (segmentation, classification) for the agent