← Back to Paper List

DriveMLM: aligning multi-modal large language models with behavioral planning states for autonomous driving

Wenhai Wang, Jiangwei Xie, Chuanyan Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai
Tsinghua University, SenseTime Research, Shanghai AI Laboratory, Nanjing University, Wuhan University
Visual Intelligence (2023)
MM Agent Benchmark

📝 Paper Summary

Autonomous Driving Multi-Modal Large Language Models (MLLMs)
DriveMLM aligns the linguistic decisions of a multi-modal LLM with standardized behavioral planning states to enable closed-loop autonomous driving in realistic simulators.
Core Problem
Existing LLM-based driving agents produce linguistic outputs that cannot directly control vehicles, preventing closed-loop operation in realistic environments.
Why it matters:
  • Traditional modular AD systems lack the semantic understanding to handle corner cases and complex user instructions.
  • End-to-end models lack world knowledge and reasoning capabilities found in LLMs.
  • Current LLM driving approaches are limited to open-loop QA or trajectory prediction without bridging the gap to actionable control signals.
Concrete Example: When an ambulance approaches, a standard planner might just follow the lane, whereas a human or LLM knows to yield. However, an LLM's text output 'yield to ambulance' is not a signal the motion controller understands; it needs a specific state like 'RIGHT_CHANGE' or 'DECELERATE'.
Key Novelty
Aligning LLM outputs with Behavioral Planning States
  • Bridges the gap between language and control by mapping LLM outputs to the specific decision states (e.g., speed and path modes) used by standard modular planning systems like Apollo.
  • Uses a multi-modal tokenizer to unify diverse inputs (LiDAR, images, traffic rules) for the LLM decoder to predict these standardized states alongside explanations.
  • Introduces an efficient data engine to collect decision states and explanation annotations from expert driving in simulators without manual frame-by-frame labeling.
Architecture
Architecture Figure Figure 3
The DriveMLM framework overview, detailing the inputs, multi-modal tokenizer, MLLM decoder, and the connection to the Apollo motion planner.
Evaluation Highlights
  • Achieves 76.1 Driving Score (DS) on CARLA Town05 Long benchmark, outperforming the Apollo baseline by 4.7 points.
  • Achieves 0.955 Miles Per Intervention (MPI) on CARLA Town05 Long, which is 1.25 times better than Apollo.
  • Demonstrates ability to handle complex instructions (e.g., 'hail an ambulance') where standard modular systems fail.
Breakthrough Assessment
8/10
Significantly advances the field by successfully integrating LLMs into a closed-loop control stack (Apollo) rather than just performing open-loop QA, with strong empirical results on CARLA.
×