← Back to Paper List

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

Junyang Wu, Mingyi Luo, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Chunxi Zhang, Junhao Wang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang
Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai Chest Hospital
arXiv (2026)
Agent MM RL

📝 Paper Summary

Robotic Surgery Autonomy Imitation Learning for Control Multi-Agent Hierarchical Control
A vision-only hierarchical control system for robotic bronchoscopy that combines a fast reactive agent for continuous motion with a slow strategic agent (using LLMs) for complex decision points, achieving expert-level navigation without external tracking sensors.
Core Problem
Robotic bronchoscopy navigation typically relies on complex external hardware (electromagnetic tracking) prone to registration errors and interference, or fails over long horizons due to the lack of distinctive visual landmarks in the bronchial tree.
Why it matters:
  • External tracking hardware increases cost, complexity, and setup time in clinical workflows
  • Registration divergence between preoperative CT and intraoperative anatomy (due to breathing) causes navigation failure in current systems
  • Existing vision-only methods struggle with long-horizon tasks in deformable, repetitive environments like the lungs, limiting autonomy to short segments
Concrete Example: In a deep airway bifurcation filled with mucus, a standard visual servoing agent gets confused by the visual artifact and repetitive texture, failing to choose the correct branch. The proposed system uses a 'Strategic Agent' to recognize the ambiguity and query a large multimodal model for high-level semantic guidance, correcting the path.
Key Novelty
Hierarchical Long-Short Term Multi-Agent Framework
  • Decomposes navigation into two agents: a high-frequency 'Short-term Reactive Agent' that handles immediate visual alignment and motion, and a sparse 'Long-term Strategic Agent' that intervenes at bifurcations or ambiguities.
  • Integrates a 'World Model Critic' that resolves conflicts between agents by simulating future video frames for candidate actions and selecting the one that best matches the virtual target view.
  • Eliminates the need for electromagnetic sensors by relying entirely on a sequence of virtual views rendered from preoperative CT scans as the navigation reference.
Evaluation Highlights
  • Achieved 100% success rate (7/7 targets) in live porcine models, matching the terminal reach of senior bronchoscopists.
  • Maintained >80% navigation success rate up to the 8th bronchial generation in ex vivo porcine lungs, significantly outperforming standard visual navigation baselines.
  • Reduced control actions by ~20% compared to expert manual teleoperation in phantom trials (275.8 vs 346.8), indicating smoother, more efficient trajectories.
Breakthrough Assessment
8/10
Demonstrates convincing sim-to-real transfer in a highly complex medical domain (live animal bronchoscopy) without external sensors. The hierarchical agent design effectively bridges the gap between low-level control and high-level reasoning.
×