← Back to Paper List

Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

Jiazhuo Li, Linjiang Cao, Qi Liu, Xi Xiong
Key Laboratory of Road and Traffic Engineering
arXiv (2026)
MM RL Agent

📝 Paper Summary

Autonomous Driving World Models Model-Based Reinforcement Learning
A world model framework for autonomous driving that integrates explicit vehicle kinematics and spatial auxiliary tasks into the latent state to improve sample efficiency and imagination fidelity.
Core Problem
Standard vision-based world models struggle to infer precise vehicle dynamics solely from pixels and often learn latent representations that lack geometric consistency, leading to unreliable long-horizon imagination.
Why it matters:
  • Pure model-free RL requires millions of interactions, making it unsafe and costly for real-world driving.
  • Existing world models often hallucinate physically impossible transitions (e.g., cars shifting laterally without steering) because they lack grounding in physical laws.
  • Crucial driving semantics like lane boundaries occupy few pixels, so pixel-reconstruction losses fail to capture the spatial structure needed for safety.
Concrete Example: In a standard vision-only world model, when an ego vehicle prepares to overtake, the predicted future frames might show the preceding vehicle blurring or shifting abruptly without physical cause. The proposed model uses kinematic data to ensure the imagined trajectory respects motion constraints.
Key Novelty
Kinematics-Grounded Latent Dynamics
  • Augments the latent encoder inputs with explicit vehicle sensor data (speed, steering) rather than forcing the model to infer physics purely from images.
  • Regularizes the latent space using auxiliary prediction heads that must output lane geometry and neighbor vehicle states, forcing the hidden state to encode spatial structure.
  • Uses these structured latent dynamics to train a policy entirely in imagination, significantly reducing the need for real-world data collection.
Evaluation Highlights
  • +23.1% improvement in Mean Return compared to an image-only baseline world model.
  • Reaches high stable performance (~200 return) in 80k steps, whereas PPO fails to reach 150 return even after 300k steps.
  • +16 percentage points increase in Success Rate by adding lane/neighbor detection heads to the base vision model.
Breakthrough Assessment
7/10
Strong practical improvements in sample efficiency for driving. The integration of specific kinematic constraints into general world models is a logical and effective step, though the architecture is largely an enhancement of DreamerV3 rather than a fundamentally new paradigm.
×