← Back to Paper List

Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, Yong Liu
AAAI Conference on Artificial Intelligence (2024)
MM Agent Memory

📝 Paper Summary

Autonomous Driving World Models Occupancy Networks
Drive-OccWorld integrates a vision-centric 4D world model with an occupancy-based planner, using conditional normalization and flexible action inputs to enable robust forecasting and safe trajectory selection.
Core Problem
End-to-end driving models often lack sufficient world knowledge to forecast dynamic environments accurately, while existing world models typically focus on video generation rather than ensuring safety and robustness for planning.
Why it matters:
  • Direct planning from raw sensors without predicting future states leads to poor generalization in complex dynamic scenarios.
  • Current world models focus on pre-training or video synthesis, missing the opportunity to use environmental forecasting directly for safe trajectory optimization.
  • Lack of fine-grained geometric constraints (like 3D occupancy) in planning can result in unsafe decisions.
Concrete Example: In a scenario with moving pedestrians and vehicles, a standard end-to-end planner might fail to anticipate a pedestrian stepping onto the road. Drive-OccWorld forecasts the 3D occupancy flow of the pedestrian based on history and ego-motion, allowing the planner to penalize trajectories that intersect with the predicted future location of the pedestrian.
Key Novelty
Drive-OccWorld: Action-Controllable 4D Occupancy World Model for Planning
  • Introduces a world model that forecasts future 3D occupancy and flow in Bird's Eye View (BEV) space, conditioned on diverse ego-actions (speed, steering, commands).
  • Uses a novel normalization technique in the memory module that modulates historical features based on semantic predictions and motion data (ego-pose and object flow) to better aggregate world knowledge.
  • Integrates the generative model with a planner that evaluates candidate trajectories against the forecasted occupancy maps to ensure collision avoidance and road adherence.
Architecture
Architecture Figure Figure 2
The overall architecture of Drive-OccWorld, illustrating the flow from historical images to future trajectory planning.
Evaluation Highlights
  • Outperforms previous methods by 9.5% in mIoU (mean Intersection over Union) for occupancy forecasting on the nuScenes dataset.
  • Achieves a 5.1% improvement in VPQ (Video Panoptic Quality) on nuScenes compared to prior state-of-the-art.
  • Surpasses baselines by 6.1% in mIoU and 5.2% in VPQ on the Lyft-Level5 dataset for forecasting movable objects and flow.
  • Improves fine-grained occupancy forecasting by 4.3% on the nuScenes-Occupancy benchmark.
Breakthrough Assessment
7/10
Solid integration of world models with occupancy-based planning. The semantic/motion-conditional normalization is a clever architectural addition. Significant performance gains on standard benchmarks validate the approach.
×