AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

📝 Paper Summary

End-to-end Autonomous Driving Vision-Language-Action (VLA) Models

AdaThinkDrive uses reinforcement learning to teach autonomous driving models when to use detailed reasoning (Chain-of-Thought) and when to act instantly, balancing trajectory accuracy with computational efficiency.

Core Problem

Existing Vision-Language-Action models for driving apply Chain-of-Thought reasoning indiscriminately, causing unnecessary latency and potential performance degradation in simple scenarios due to 'over-reasoning'.

Why it matters:

Always-on reasoning increases computational overhead, which is critical for real-time autonomous driving systems
In simple scenarios (e.g., highway cruising), explicit reasoning steps can introduce hallucinations or uncertainty, lowering decision quality compared to direct prediction
Current methods lack the flexibility to adapt their inference strategy based on scene complexity

Concrete Example: In a simple Level 1 scenario (e.g., straight road, no obstacles), a standard CoT model might over-analyze distant, irrelevant objects, increasing latency and potentially predicting erratic behavior. AdaThinkDrive recognizes the simplicity and outputs the trajectory directly.

Key Novelty

Fast Answering / Slow Thinking Mechanism

Implements a dual-mode strategy where the model learns to either output a trajectory directly (Fast) or generate a reasoning chain first (Slow) based on input complexity
Uses Group Relative Policy Optimization (GRPO) with an Adaptive Think Reward to dynamically optimize the decision of *when* to reason, rather than relying solely on static rules

Architecture

The AdaThinkDrive framework pipeline, illustrating data preparation, two-stage fine-tuning, and the adaptive reasoning mechanism via RL.

Evaluation Highlights

Achieves 90.3 PDMS on the NAVSIM benchmark, outperforming the best vision-only baseline by 1.7 points
Surpasses both 'Never-Think' and 'Always-Think' baselines by 2.0 and 1.4 PDMS points respectively, proving the value of adaptive switching
Reduces inference time by 14% compared to the 'Always-Think' baseline by bypassing reasoning in 84% of simple scenarios

Breakthrough Assessment

8/10

Significantly improves efficiency and accuracy in VLA driving models by solving the 'when to reason' problem. The adaptive RL approach is a strong methodological contribution applicable beyond driving.

⚙️ Technical Details

Problem Definition

Setting: End-to-end trajectory prediction given visual and state inputs, with latent discrete choice over reasoning modes

Inputs: Front-view image, navigation commands, ego state (velocity, acceleration), historical trajectory

Outputs: Planned trajectory (waypoints), optionally preceded by textual reasoning

Pipeline Flow

VLA Backbone (Encodes inputs)
Adaptive Policy (Decides reasoning mode implicit in generation)
Generation (Produces Reasoning+Trajectory OR Trajectory only)

System Modules

VLA Backbone

Encodes multi-modal inputs (image, text, history) into a unified representation

Model or implementation: InternVL3 (implied by comparative study)

Adaptive Policy

Generates the output sequence, implicitly selecting 'Thinking' or 'Non-Thinking' mode based on learned probability

Model or implementation: Same VLA Backbone (Unified Interface)

Novel Architectural Elements

Unified output interface supporting both 'Thinking' (CoT) and 'Non-Thinking' (Direct) formats within a single model
Adaptive Think Reward mechanism integrated into RL training to dynamically regulate reasoning frequency

Modeling

Base Model: InternVL3 (implied by preliminary study and VLA context)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected utility of the selected reasoning mode and trajectory.

Formally: J_GRPO(θ) = E[min(c_i A_i, clip(c_i, 1-ε, 1+ε)A_i) - β D_KL(π_θ || π_ref)]
Purpose: Reward trajectory quality.

Formally: R_traj = PDMS Score (0 to 1)
Purpose: Enforce output structure.

Formally: R_fmt (Discrete penalties for malformed tags)
Purpose: Reward endpoint accuracy.

Formally: R_endpoint (Piecewise L1 distance reward)
Purpose: Encourage adaptive mode switching.

Formally: R_adaptive (Dynamic reward comparing rollout performance)

Training Data:

Pre-training: DriveLM, LingoQA, ImpromptuVLA, NuScenes-QA, NuInstruct, OminiDrive
SFT: Custom NAVSIM dataset with dual annotations (Think/Non-Think) generated by auxiliary model and Qwen2.5-VL-72B
Data Complexity Levels: Level 1 (Simple), Level 2 (Intermediate), Level 3 (Complex)

Compute: Not reported in the paper

Comparison to Prior Work

vs. EMMA/ReasonPlan: These use 'always-on' CoT. AdaThinkDrive selectively disables CoT for simple scenes to improve efficiency and reduce uncertainty.
vs. AdaptThink: AdaptThink uses RL for adaptive CoT in general LLMs; AdaThinkDrive adapts this to the specific constraints (PDMS, trajectory accuracy) of autonomous driving [not cited in paper].

Limitations

Reliance on a specific closed-source or large teacher model (Qwen2.5-VL-72B) for data annotation
Performance depends heavily on the quality of the 'Think' style annotations generated during SFT
Adaptive mechanism requires careful tuning of reward components to prevent mode collapse (always think or never think)

Reproducibility

Code not provided. SFT dataset is custom-generated using Qwen2.5-VL-72B for annotations. Training relies on specific dual-mode data construction (Think/Non-Think pairs). Hyperparameters for GRPO (epsilon, beta) are mentioned as variables but exact values are not in the snippet.

📊 Experiments & Results

Evaluation Setup

Open-loop trajectory prediction on validation set

Benchmarks:

NAVSIM (End-to-end autonomous driving simulation)

Metrics:

PDMS (Predictive Driver Model Score)
Inference Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AdaThinkDrive outperforms baselines in overall trajectory quality (PDMS).
NAVSIM	PDMS	88.6	90.3	+1.7
Ablation studies demonstrate the superiority of the adaptive strategy over fixed strategies.
NAVSIM	PDMS	88.3	90.3	+2.0
NAVSIM	PDMS	88.9	90.3	+1.4
NAVSIM	Inference Time	100%	86%	-14%

Experiment Figures

Performance comparison of Think vs Non-Think models across three levels of scene complexity.

Main Takeaways

Optimal reasoning strategy is dependent on scene complexity; 'Always-Think' is not optimal for simple scenes.
AdaThinkDrive successfully learns to distinguish scenarios: it uses CoT in 96% of challenging scenarios but defaults to direct prediction in 84% of simple scenarios.
The adaptive reward mechanism prevents the model from collapsing into a single mode (always/never think).

📚 Prerequisite Knowledge

Prerequisites

End-to-end Autonomous Driving
Vision-Language Models (VLMs)
Reinforcement Learning (PPO/GRPO)
Chain-of-Thought (CoT) Prompting

Key Terms

VLA: Vision-Language-Action—models that process visual and text inputs to generate physical actions (trajectories)

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing a group of outputs sampled from the same input

PDMS: Predictive Driver Model Score—a metric for evaluating the quality and human-likeness of predicted trajectories in autonomous driving

SFT: Supervised Fine-Tuning—training the model on labeled examples (demonstrations) before reinforcement learning

Fast Answering: Directly predicting the driving trajectory without intermediate textual reasoning

Slow Thinking: Generating a detailed textual analysis of the scene before predicting the trajectory