DriveMLM: aligning multi-modal large language models with behavioral planning states for autonomous driving

📝 Paper Summary

Autonomous Driving Multi-Modal Large Language Models (MLLMs)

DriveMLM aligns the linguistic decisions of a multi-modal LLM with standardized behavioral planning states to enable closed-loop autonomous driving in realistic simulators.

Core Problem

Existing LLM-based driving agents produce linguistic outputs that cannot directly control vehicles, preventing closed-loop operation in realistic environments.

Why it matters:

Traditional modular AD systems lack the semantic understanding to handle corner cases and complex user instructions.
End-to-end models lack world knowledge and reasoning capabilities found in LLMs.
Current LLM driving approaches are limited to open-loop QA or trajectory prediction without bridging the gap to actionable control signals.

Concrete Example: When an ambulance approaches, a standard planner might just follow the lane, whereas a human or LLM knows to yield. However, an LLM's text output 'yield to ambulance' is not a signal the motion controller understands; it needs a specific state like 'RIGHT_CHANGE' or 'DECELERATE'.

Key Novelty

Aligning LLM outputs with Behavioral Planning States

Bridges the gap between language and control by mapping LLM outputs to the specific decision states (e.g., speed and path modes) used by standard modular planning systems like Apollo.
Uses a multi-modal tokenizer to unify diverse inputs (LiDAR, images, traffic rules) for the LLM decoder to predict these standardized states alongside explanations.
Introduces an efficient data engine to collect decision states and explanation annotations from expert driving in simulators without manual frame-by-frame labeling.

Architecture

The DriveMLM framework overview, detailing the inputs, multi-modal tokenizer, MLLM decoder, and the connection to the Apollo motion planner.

Evaluation Highlights

Achieves 76.1 Driving Score (DS) on CARLA Town05 Long benchmark, outperforming the Apollo baseline by 4.7 points.
Achieves 0.955 Miles Per Intervention (MPI) on CARLA Town05 Long, which is 1.25 times better than Apollo.
Demonstrates ability to handle complex instructions (e.g., 'hail an ambulance') where standard modular systems fail.

Breakthrough Assessment

8/10

Significantly advances the field by successfully integrating LLMs into a closed-loop control stack (Apollo) rather than just performing open-loop QA, with strong empirical results on CARLA.

⚙️ Technical Details

Problem Definition

Setting: Closed-loop autonomous driving in a simulator

Inputs: Multi-view images I, LiDAR point clouds L, System messages M (rules/task definitions), User instructions U

Outputs: Decision state tokens S (Speed and Path decisions) and textual Explanation E

Pipeline Flow

Input Processing: Multi-Modal Tokenizer (Images + LiDAR + Text) → Unified Tokens
Decision Making: MLLM Decoder → Decision States + Explanation
Execution: Motion Planning (Apollo) → Vehicle Control

System Modules

Multi-Modal Tokenizer

Converts multi-view images, LiDAR, and text into a unified token sequence.

Model or implementation: CLIP-ViT-L/14 (Images) + SST (LiDAR)

MLLM Decoder

Predicts driving decision states and generates explanations.

Model or implementation: LLaMA-based MLLM (specific variant implied as LLaMA given context but not explicitly versioned in main text)

Motion Planner

Converts behavioral states into trajectory and control signals.

Model or implementation: Apollo Planning Module

Novel Architectural Elements

Alignment of MLLM output tokens directly to Apollo's behavioral planning state space (Speed/Path modes)
LiDAR-to-Image-to-Text alignment pipeline where LiDAR features are distilled from a frozen CLIP image encoder to align with text space

Modeling

Base Model: LLaMA (implied by context of MLLM trends, exact version not specified in text body)

Training Method: Supervised Fine-Tuning (SFT) with Next Token Prediction

Objective Functions:

Purpose: Train the model to predict decisions and explanations.

Formally: Cross-entropy loss on next token prediction.

Training Data:

280 hours of driving data collected in CARLA
Converted to decision states via rule-based annotation from expert trajectories
Explanations generated by GPT-3.5 based on scenario data

Key Hyperparameters:

image_encoder: ViT-L/14 (CLIP)
lidar_encoder: SST (Single-Stride Sparse Transformer)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Apollo: DriveMLM replaces the rule-based decision module with an MLLM, keeping the downstream motion planner.
vs. End-to-End (Transfuser, ThinkTwice): DriveMLM uses a modular approach where the LLM acts as the high-level planner, allowing for interpretability and instruction following.
vs. DriveGPT4: DriveMLM supports multi-view images and LiDAR (DriveGPT4 is mono-camera) and performs closed-loop control in a realistic simulator (DriveGPT4 is largely open-loop/dataset evaluations).

Limitations

Heavy reliance on the underlying motion planner (Apollo); if Apollo fails to execute a valid decision, the system fails.
Latency concerns for real-time inference of MLLMs are not addressed in the performance metrics.
Dependency on GPT-3.5 for generating training explanations introduces potential hallucinations or biases in the explanation data.

Reproducibility

Code: https://github.com/OpenDriveLab/DriveMLM

Code is publicly available at https://github.com/OpenDriveLab/DriveMLM. The paper details the data engine for generating decision/explanation annotations. Specific model size (e.g., LLaMA-7B vs 13B) and training compute time are not explicitly detailed in the text.

📊 Experiments & Results

Evaluation Setup

Closed-loop driving simulation in CARLA

Benchmarks:

CARLA Town05 Long (Closed-loop autonomous driving)

Metrics:

Driving Score (DS)
Route Completion (RC)
Infraction Score (IS)
Miles Per Intervention (MPI)
Prediction Accuracy (for decision states)
BLEU-4/CIDEr/METEOR (for explanations)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DriveMLM outperforms both the traditional Apollo stack and end-to-end baselines on the challenging CARLA Town05 Long benchmark.
CARLA Town05 Long	Driving Score (DS)	71.4	76.1	+4.7
CARLA Town05 Long	Driving Score (DS)	39.2	76.1	+36.9
CARLA Town05 Long	Miles Per Intervention (MPI)	0.764	0.955	+0.191
CARLA Town05 Long	Route Completion (RC)	95.2	96.4	+1.2
Open-loop evaluation confirms the model's ability to accurately predict decision states and generate high-quality explanations.
Internal Validation Set	Speed Decision Accuracy	81.3	84.5	+3.2
Internal Validation Set	CIDEr (Explanation Quality)	119.5	127.3	+7.8

Experiment Figures

Qualitative results showing the model handling special instructions in closed-loop scenarios.

Main Takeaways

Replacing the rule-based decision module of Apollo with DriveMLM leads to significant improvements in driving score and interventions.
The model successfully generalizes to handle special instructions (e.g., yielding to emergency vehicles) which rule-based systems struggle with.
Multi-modal inputs (Vision + LiDAR) consistently outperform vision-only baselines in both decision accuracy and explanation quality.
The 'Plug-and-Play' nature allows integration with existing stacks (Apollo) without retraining the motion control modules.

📚 Prerequisite Knowledge

Prerequisites

Autonomous Driving (AD) modular pipelines (Perception, Planning, Control)
Multi-Modal Large Language Models (MLLMs)
Transformer architectures (CLIP, ViT)

Key Terms

Behavioral Planning: A high-level decision-making layer in AD stacks that determines maneuvers (e.g., change lane, stop) which are then executed by a motion planner.

Apollo: An open-source, industrial-grade autonomous driving software platform.

CARLA: An open-source simulator for autonomous driving research.

Closed-loop driving: A testing setting where the model's decisions actively control the vehicle and influence future states, as opposed to passive dataset prediction.

MPI: Miles Per Intervention—a metric measuring the average distance a vehicle drives autonomously before requiring human takeover.

DS: Driving Score—a composite metric in CARLA measuring route completion weighted by infractions.

CLIP: Contrastive Language-Image Pre-training—a model used to align visual and text representations.

SST: Single-Stride Sparse Transformer—a specific architecture for processing LiDAR point clouds efficiently.

Driving Score (DS): A metric combining route completion percentage and infraction penalties.

Route Completion (RC): The percentage of the route distance completed by the agent.