From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring

📝 Paper Summary

Agentic AI in Healthcare Remote Patient Monitoring (RPM)

Sentinel is an autonomous AI agent that uses 21 structured clinical tools to triage remote patient monitoring data, achieving higher sensitivity for emergencies than individual clinicians.

Core Problem

Remote patient monitoring (RPM) generates overwhelming data floods where most alerts are noise, yet prior trials failed because simple threshold-based filtering lacks the clinical context to distinguish true emergencies.

Why it matters:

Landmark heart failure trials (Tele-HF, BEAT-HF) failed to improve outcomes because clinicians were buried in irrelevant alerts, leading to alert fatigue and missed signals
Effective monitoring (TIM-HF2) requires 24/7 physician staffing to interpret context, which is prohibitively expensive and unscalable for widespread chronic disease management

Concrete Example: A weight gain of 3 lbs might trigger a rule-based alert for heart failure. A rule-based system flags it blindly. Sentinel retrieves context showing the patient recently increased diuretic dosage and has stable breathing, correctly classifying it as 'Monitor' rather than 'Emergency'.

Key Novelty

Context-Aware Autonomous Clinical Agent (Sentinel)

Equips an LLM with 21 structured tools (via Model Context Protocol) to autonomously retrieve patient history, medications, and notes, simulating a clinician's chart review process
Replaces fixed rule-based alerts with dynamic multi-step reasoning, allowing the agent to determine what data is necessary to evaluate a specific vital sign reading

Architecture

System architecture of Sentinel showing the interaction between the AI Agent, the Model Context Protocol (MCP) Host, and the various Tool Services.

Evaluation Highlights

Achieved 95.8% sensitivity for emergency classifications (23/24) and 88.5% sensitivity for all actionable alerts (92/104) against a human majority-vote reference standard
Outperformed every individual human clinician in leave-one-out analysis for emergency sensitivity (97.5% vs. clinician aggregate 60.0%)
Demonstrated almost perfect self-consistency (Fleiss' κ = 0.850) across 5 independent runs, significantly higher than human inter-rater agreement (pairwise exact agreement ~60%)

Breakthrough Assessment

8/10

Demonstrates the first deployed autonomous agent using Model Context Protocol for clinical RPM triage. Significantly outperforms rule-based baselines and individual clinicians in sensitivity, offering a scalable solution to the 'data flood' problem that plagued prior RPM trials.

⚙️ Technical Details

Problem Definition

Setting: Retrospective clinical triage of vital sign readings (BP, SpO2, Weight) from home devices

Inputs: Raw vital sign reading, patient ID, reading metadata

Outputs: Severity classification (EMERGENCY, URGENT, MONITOR, NOT AN ISSUE) and Action Type

Pipeline Flow

Vital Sign Input
Agent Reasoning Loop (Thought -> Tool Selection -> Tool Execution -> Observation)
Final Classification & Action Plan

System Modules

Clinical Data Service

Provide structured access to patient EHR data

Model or implementation: Anthropic Claude Opus 3.5 (implied via claude-opus-4-6)

Terminology Service

Look up diagnostic codes

Model or implementation: N/A (Database lookup)

Reasoning Core

Synthesize context and assign triage level

Model or implementation: claude-opus-4-6

Novel Architectural Elements

Integration of Model Context Protocol (MCP) for standardized clinical tool access in an autonomous agent loop
Dynamic context retrieval where the agent autonomously decides which of 21 tools to call based on the specific patient case, rather than a fixed pre-processing pipeline

Modeling

Base Model: Anthropic claude-opus-4-6

Compute: Inference only; Median 94.5 seconds per triage; Mean cost $0.34 per trial

Comparison to Prior Work

vs. Rule-based/Threshold systems: Sentinel uses LLM reasoning with full EHR context to filter noise, whereas rules only look at scalar values
vs. Standard LLM Triage: Sentinel uses agentic tool use to actively retrieve missing context, rather than relying on a static context window
vs. Human Triage Centers (TIM-HF2): Sentinel automates the contextual review process at $0.34/triage, enabling scalability humans cannot match

Limitations

Retrospective evaluation only; no prospective clinical outcomes data
Reference standard relies on human consensus, which showed high variability (only ~60% agreement)
Study population restricted to a single care program (AnsibleHealth), potentially limiting generalizability
Agent overtriage rate (22.5%) is clinically safe but still generates notable workload
Dependency on proprietary model (Claude Opus) and specific EHR tool integrations

📊 Experiments & Results

Evaluation Setup

Retrospective triage of 500 RPM readings from 340 polychronic patients

Metrics:

Sensitivity (Emergency, Actionable)
Specificity
Fleiss' Kappa (Inter-rater reliability)
Quadratic Weighted Kappa (Agreement with reference)
Statistical methodology: Bootstrap resampling (2000 resamples) for confidence intervals; Leave-One-Out (LOO) analysis for human comparison

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Majority Vote Reference Standard	Actionable Sensitivity	98.1%	88.5%	-9.6%
Majority Vote Reference Standard	Specificity	59.2%	85.7%	+26.5%
Leave-One-Out Reference	Emergency Sensitivity	60.0%	97.5%	+37.5%
Leave-One-Out Reference	Actionable Sensitivity	69.5%	90.9%	+21.4%
Self-Consistency	Fleiss' Kappa	0.60	0.85	+0.25

Main Takeaways

Sentinel solves the 'data flood' paradox by filtering noise with high specificity (85.7%) while maintaining higher emergency sensitivity (97.5%) than individual clinicians.
The agent's disagreements are clinically defensible: independent adjudication confirmed 100% of severe overtriage cases were 'Justified' or 'Debatable', with 0% deemed 'True Overtriage' after consensus.
Cost and speed ($0.34/triage, ~95s) make contextual monitoring scalable, potentially enabling the mortality benefits of TIM-HF2 without the prohibitive staffing costs.

📚 Prerequisite Knowledge

Prerequisites

Remote Patient Monitoring (RPM) workflows
Large Language Models (LLMs) and tool use
Clinical triage protocols

Key Terms

RPM: Remote Patient Monitoring—collection of health data (vitals) from patients outside traditional care settings

MCP: Model Context Protocol—a standard interface for connecting AI agents to structured data sources and tools

Agentic AI: LLM systems capable of autonomous multi-step reasoning and tool execution to solve complex tasks

Alert Fatigue: Desensitization of clinicians to alarms due to a high volume of false or insignificant alerts

Leave-One-Out (LOO): An evaluation method where each rater is compared against a reference standard formed by the remaining raters, ensuring fair head-to-head comparison

Fleiss' Kappa: A statistical measure for assessing the reliability of agreement between a fixed number of raters

QWK: Quadratic Weighted Kappa—a metric for inter-rater agreement that penalizes larger disagreements more heavily than smaller ones