MedCollab: Causal-Driven Multi-Agent Collaboration for Full-Cycle Clinical Diagnosis via IBIS-Structured Argumentation

📝 Paper Summary

Medical Multi-Agent Systems Clinical Diagnosis Reasoning & Argumentation

MedCollab improves clinical diagnosis by using a multi-agent team that structures reasoning into causal chains and traceable arguments, audited by a consensus mechanism to reduce hallucinations.

Core Problem

Existing medical LLMs and agents treat diagnosis as independent associations between symptoms and diseases, failing to model causal progression and leading to hallucinations lacking traceable evidence.

Why it matters:

Correlation-based diagnosis often misses the root cause or pathological progression (e.g., treating a symptom rather than the underlying condition)
Unstructured model outputs lack traceability, making it impossible for clinicians to audit the logic behind a high-stakes medical decision
Hallucinations in clinical settings are dangerous; systems must strictly ground assertions in patient-specific examination results

Concrete Example: A patient might present with Anemia. A standard model might simply output 'Anemia' as the diagnosis. MedCollab traces the causal chain: 'Trauma → Rib Fracture → Lung Hemorrhage → Anemia', identifying the root trauma and intermediate hemorrhage that require treatment, rather than just the final symptom.

Key Novelty

Causal-Driven IBIS Argumentation

Transforms unstructured agent dialogue into a structured Issue-Based Information System (IBIS), requiring every diagnostic claim to be a 'Position' supported by an explicit 'Argument' and traceable 'Evidence'
Constructs a Hierarchical Disease Causal Chain (HDCC) that links isolated diagnoses into a directed graph of pathological progression (Causality vs. Comorbidity)
Implements a logic auditing mechanism where a General Practitioner agent iteratively penalizes and down-weights specialists whose arguments contradict the evidence

Architecture

The MedCollab framework workflow, detailing the transition from Agent Recruitment to IBIS Argumentation, Causal Chain Construction, and Consensus Optimization.

Evaluation Highlights

Achieves 76.9% Accuracy on ClinicalBench, outperforming the strongest multi-agent baseline (ClinicalAgent) by 8.2 percentage points
Reaches 72.4% Comprehensive Diagnostic Rate (CDR) on ClinicalBench, surpassing the best baseline by over 13.1 percentage points in multi-comorbidity cases
Diagnostic Basis RaTEScore of 62.0% on ClinicalBench, significantly higher than leading LLMs like Gemini-3-Flash, indicating superior reasoning quality

Breakthrough Assessment

8/10

Strong contribution in structuring medical reasoning. Moving from flat predictions to causal chains and IBIS-structured argumentation directly addresses the 'black box' and hallucination issues in medical AI.

⚙️ Technical Details

Problem Definition

Setting: Full-cycle clinical diagnosis from unstructured patient case data

Inputs: Patient clinical case S (chief complaint, medical history, raw examination findings)

Outputs: Final diagnostic report including primary diagnosis, causal progression chain, and treatment plan

Pipeline Flow

GP Agent (Recruitment) → Examination Agents (Evidence Base Construction)
Specialist Agents (IBIS Argumentation Generation)
HDCC Construction (Causal Chain Aggregation)
Consensus Mechanism (Logic Auditing & Weight Updates)

System Modules

GP Agent

Analyzes the case, recruits relevant specialists/examination agents, and acts as the logic auditor

Model or implementation: LLM-based agent (Architecture not explicitly specified)

Examination Agents

Interpret raw medical findings to generate structured reports (Evidence Base)

Model or implementation: LLM-based agent

Specialist Agents

Generate diagnostic hypotheses using IBIS structure (Position + Argument + Evidence)

Model or implementation: LLM-based agent

Logic Auditor

Evaluates arguments against evidence and updates agent weights via exponential penalty

Model or implementation: Algorithm / GP Agent Component

Novel Architectural Elements

Integration of IBIS protocol into agent output structure to enforce evidence traceability
Hierarchical Disease Causal Chain (HDCC) construction module that explicit models disease progression graphs from agent outputs
Iterative consensus mechanism using logical inconsistency penalties to dynamically re-weight agents

Modeling

Base Model: Not explicitly reported in the paper (likely GPT-4 or similar high-capability LLM given the baselines, but text only mentions DeepSeek-V3 for ground truth generation)

Comparison to Prior Work

vs. ClinicalAgent: MedCollab adds explicit causal chaining (HDCC) and IBIS-structured constraints, whereas ClinicalAgent uses standard reasoning which may lack etiological consistency.
vs. MEDDxAgent: MedCollab handles full-cycle diagnosis including department routing and report generation, while MEDDxAgent focuses on dialogue.
vs. Standard LLMs (GPT-4, etc.): MedCollab distributes reasoning across specialists and enforces evidence grounding, reducing hallucinations compared to monolithic models.

Limitations

Relies on the quality of the underlying LLM's medical knowledge; logic auditing can only filter, not generate new knowledge
Computationally more intensive than single-turn diagnosis due to multi-agent recruitment and iterative consensus rounds
Requires raw examination findings to be available for the Evidence Base; may struggle with incomplete patient records
Specific penalty coefficient λ for logic auditing is a hyperparameter that may need tuning

Reproducibility

No code URL provided in the paper text. Dataset availability: ClinicalBench is an existing dataset; MIMIC-IV is public but requires credentialed access. Ground truth generation used DeepSeek-V3 with specific prompting (prompts not provided in text).

📊 Experiments & Results

Evaluation Setup

Full-cycle diagnosis on real-world clinical cases

Benchmarks:

ClinicalBench (Real-world clinical case diagnosis (1,500 cases))
MIMIC-IV (Critical care case diagnosis (subset of 595 cases))

Metrics:

Accuracy (ACC)
Comprehensive Diagnostic Rate (CDR)
Department Classification Accuracy (DCA)
Entity-F1 (Medical Factual Consistency)
RaTEScore (Reasoning Quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MedCollab demonstrates superior diagnostic accuracy and department routing on ClinicalBench compared to the strongest baselines.
ClinicalBench	Accuracy (ACC)	68.7	76.9	+8.2
ClinicalBench	Comprehensive Diagnostic Rate (CDR)	59.3	72.4	+13.1
MIMIC-IV	Accuracy (ACC)	Not reported in the paper	57.7	Not reported in the paper
Ablation studies reveal the critical importance of the Logic Auditing and Causal Chain components.
ClinicalBench	Accuracy (ACC)	76.9	49.7	-27.2
ClinicalBench	Accuracy (ACC)	76.9	52.9	-24.0
ClinicalBench	RaTEScore (Diagnostic Basis)	62.0	51.7	-10.3

Experiment Figures

Comparison of generation quality (RaTEScore, BLEU, ROUGE) across four report sections: Diagnostic Basis, Differential Diagnosis, Therapeutic Principle, and Treatment Plan.

Main Takeaways

Explicitly modeling disease progression via Causal Chains (HDCC) is as critical as the agent architecture itself; removing it drops accuracy by ~24%.
Logic Auditing is the most vital component; without the GP-led penalty for inconsistent arguments, the system collapses (ACC drops to 49.7%).
MedCollab effectively reduces hallucinations (higher Entity-F1) by forcing agents to cite traceable evidence from the Evidence Base via the IBIS protocol.
The system excels in complex multi-comorbidity cases (high CDR), likely because the causal chain distinguishes root causes from downstream complications.

📚 Prerequisite Knowledge

Prerequisites

Understanding of multi-agent systems (roles, collaboration)
Basic medical diagnostic workflow (triage, differential diagnosis)
Knowledge of argumentation structures

Key Terms

IBIS: Issue-Based Information System—a structured argumentation framework that breaks reasoning into Issues, Positions, and Arguments

HDCC: Hierarchical Disease Causal Chain—a directed graph modeling the pathological progression of diseases (e.g., A causes B) rather than just listing them

RaTEScore: A metric for evaluating medical reports that assesses the semantic correctness of entities and reasoning logic, rather than just text overlap

Entity-F1: A metric measuring the factual consistency of medical entities (symptoms, diseases) in the output compared to ground truth

CDR: Comprehensive Diagnostic Rate—measures the system's ability to correctly identify the full set of relevant diagnoses in complex cases with multiple conditions

DCA: Department Classification Accuracy—measures how accurately the system routes the patient to the correct medical specialty (e.g., Cardiology vs. Neurology)