Meissa: Multi-modal Medical Agentic Intelligence

📝 Paper Summary

Medical Multi-modal Large Language Models Agentic AI Model Distillation

Meissa distills complex medical agent behaviors (tool use, multi-step reasoning) from large proprietary models into a lightweight 4B-parameter model executable offline by training on stratified, error-driven trajectories.

Core Problem

Current high-performance medical agents rely on proprietary frontier models (e.g., GPT-4) via cloud APIs, making them unsuitable for clinical settings due to high cost, latency, and privacy risks.

Why it matters:

Patient data privacy regulations often prohibit sending medical images to external cloud APIs.
Repeated API calls for multi-step agentic reasoning create prohibitive costs and latency that disrupt real-time clinical workflows.
Existing small models lack the 'agentic' ability to decide when to use tools versus answering directly, limiting them to single-pass tasks.

Concrete Example: A clinician needs a diagnosis from a CT scan. A frontier agent might invoke a segmentation tool, analyze the mask, and debate with a sub-agent, taking 30+ seconds and costing $0.50 via API. A standard small model just guesses immediately and incorrectly. Meissa runs this multi-step process locally in ~1.4s.

Key Novelty

Unified Agentic Behavior Distillation via Stratified Supervision

Treats 'strategy selection' (whether to use tools) as a learned behavior by training on a mix of direct answers (for easy queries) and tool-use trajectories (for hard queries) based on model error rates.
Unifies heterogeneous agent environments (tool calling, visual reasoning, multi-agent debate) into a single state-action-observation format, allowing one model to master all interaction modes.
Uses 'prospective-retrospective' training: pairs exploratory forward traces (what happened) with clean hindsight summaries (why it happened) to stabilize policy learning.

Architecture

Overview of the Meissa framework, illustrating the stratified data synthesis pipeline and the unified trajectory learning.

Evaluation Highlights

Matches or exceeds proprietary frontier agents (Gemini-Pro-1.5, GPT-4o) in 10 of 16 evaluation settings across 13 medical benchmarks.
Reduces end-to-end latency by ~22x compared to API-based frontier agent deployment.
Uses >25x fewer parameters (4B) than typical frontier models like Gemini-3 while maintaining competitive performance.

Breakthrough Assessment

9/10

Successfully distills complex agentic reasoning—usually reserved for massive models—into a deployable 4B model. The stratified supervision strategy elegantly solves the 'when to act' routing problem without a separate router.

⚙️ Technical Details

Problem Definition

Setting: Medical Visual Question Answering and Clinical Reasoning with optional tool/external interaction

Inputs: Medical image(s) and natural language query

Outputs: Final text answer, reached via an optional sequence of actions (tools/sub-agents)

Pipeline Flow

Unified Agent (Meissa) processes Input (Image + Text)
Action Generation (Autoregressive prediction of next token)
If Action == <|call|>: Execute Tool/Sub-agent → Receive Observation → Loop
If Action == <|assistant|>: Output Final Answer

System Modules

Meissa

Decides whether to answer directly or call tools; generates actions and final answers

Model or implementation: Qwen3-VL-4B (fine-tuned)

Tool Executor

Executes external actions requested by the model

Model or implementation: Deterministic functions or external APIs (e.g., SAM2, BioMedParse)

Novel Architectural Elements

Implicit Strategy Routing: The model architecture is standard, but the 'routing' (choosing between depth=0 direct answer vs. depth>0 agentic path) is internalized into the first token prediction via stratified training data, rather than using an external router module.
Unified State-Action-Observation formalism: A single schema representing widely different agent interactions (from debating sub-agents to segmentation tools) allowing cross-environment generalization.

Modeling

Base Model: Qwen3-VL-4B

Training Method: Supervised Fine-Tuning (SFT) / Behavioral Cloning

Objective Functions:

Purpose: Maximize likelihood of the next token in the trajectory.

Formally: Standard autoregressive language modeling loss.

Adaptation: Full fine-tuning

Trainable Parameters: 4B (all parameters)

Training Data:

~40K total trajectories derived from Gemini-3-flash teacher
8.2K Direct Reasoning (Tier 1: easy)
9.8K Enhanced Reasoning (Tier 2: medium)
23.9K Agentic Trajectories (Tier 3: hard, multi-step)

Key Hyperparameters:

training_time: ~12 hours
gpu_config: 8x A6000 GPUs
batch_size: Not reported in the paper
+ 1 more
learning_rate: Not reported in the paper

Compute: Training: ~12 hours on 8x A6000. Inference: ~22x lower latency than API-based agents.

Comparison to Prior Work

vs. Med-Gemini: Meissa is open-weights, 4B params (vs huge), and runs offline while matching performance on many tasks.
vs. MedRAX: Meissa learns a unified policy across 4 environments (not just X-rays) and learns *when* to use tools via error-driven stratification.
vs. RouteLLM: Meissa internalizes routing as an emergent behavior of the generation policy rather than training a separate classification head.
+ 1 more
vs. STeP [not cited in paper]: STeP also distills reasoning, but Meissa specifically distills *interactive* agentic policies (multi-turn tool use) rather than just Chain-of-Thought paths.

Limitations

Relies on a proprietary teacher (Gemini-3) for data synthesis; quality is bounded by the teacher's capabilities.
Maximum trajectory depth is fixed per environment type during data generation (though learned dynamically).
Evaluation focuses on medical domain; generalization to general-purpose agentic tasks is not tested.

Reproducibility

Code: https://github.com/Schuture/Meissa

Artifacts available at https://github.com/Schuture/Meissa: code, data (40K trajectories), and models. The paper relies on proprietary Gemini-3-flash for data generation, which is a closed-source dependency for the *creation* of the training data, though the resulting dataset is released.

📊 Experiments & Results

Evaluation Setup

Medical Visual Question Answering and Clinical Reasoning across 4 heterogeneous agent environments

Benchmarks:

MIMIC-CXR (Radiology Report Generation/VQA)
VQA-RAD (Radiology VQA)
Slake (Bilingual Medical VQA)
NEJM (Clinical Case Reasoning)
AgentClinic (Interactive Clinical Diagnosis)

Metrics:

Accuracy
End-to-end Latency
Strategy Selection Accuracy (routing)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Meissa outperforms comparable open-source models and rivals proprietary frontier models on varied medical benchmarks.
MIMIC-CXR (Visual Grounding)	Accuracy	32.4	68.2	+35.8
VQA-RAD	Accuracy	78.2	82.5	+4.3
NEJM (Clinical Reasoning)	Accuracy	38.6	39.4	+0.8
Average across tasks	Latency (seconds)	31.2	1.4	-29.8
Strategy Selection (Routing)	Accuracy	Not reported in the paper	Near-oracle	Not reported in the paper

Experiment Figures

Visualization of the four distinct agent environments Meissa is trained on.

Main Takeaways

Small, specialized agentic models (4B) can outperform general-purpose frontier models (GPT-4o, 72B) when equipped with tools and trained on high-quality agentic trajectories.
Stratified supervision effectively teaches the model *when* to use tools, avoiding unnecessary computational cost for easy queries.
The unified state-action representation allows a single model to handle diverse medical tasks (radiology, pathology, clinical interaction) without task-specific architectural changes.

📚 Prerequisite Knowledge

Prerequisites

Multi-modal Large Language Models (MM-LLMs)
Agentic AI concepts (tool use, reasoning chains)
Knowledge Distillation / Behavioral Cloning

Key Terms

MM-LLMs: Multi-modal Large Language Models—AI models capable of processing and generating both text and images.

Agentic behavior: The ability of a model to autonomously decide to take external actions (like calling a tool or asking a sub-agent) before answering, rather than just generating text immediately.

Trajectory: The recorded sequence of thoughts, actions, and observations an agent takes to solve a problem.

Stratified supervision: A training strategy where data is organized by difficulty; easy samples teach direct answering, while hard samples teach complex tool use.

Frontier models: The most advanced, usually proprietary and closed-source, AI models available (e.g., GPT-4, Gemini Ultra).

SFT: Supervised Fine-Tuning—training a model on labeled examples.

Prospective trajectories: Traces recorded during the agent's actual forward attempt to solve a problem (exploratory).

Retrospective trajectories: Traces generated after the fact, rewriting the reasoning path to be cleaner and more logical based on the known outcome (hindsight).

Behavioral Cloning: A method where a student model learns to mimic the exact actions taken by a teacher model in a given situation.

OOD: Out-Of-Distribution—data or tasks that differ significantly from what the model was trained on.