Evaluation Setup
Evaluated across 5 diverse 3D tasks: human pose/motion generation, grasp generation, navigation, and arm motion planning.
Benchmarks:
- PROX / LEMO (Human Pose & Motion Generation)
- MultiDex (Dexterous Grasp Generation)
- ScanNet (Custom graphs) (3D Navigation Path Planning) [New]
- MoveIt (Simulated) (Robot Arm Motion Planning) [New]
Metrics:
- Plausible Rate (Human & Auto)
- Non-collision Score
- Success Rate
- Diversity (APD)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Human pose generation results showing superior physical plausibility compared to cVAE baselines. |
| PROX |
Plausible Rate |
14.64 |
49.35 |
+34.71
|
| PROX |
Non-collision Score |
99.75 |
99.93 |
+0.18
|
| Dexterous grasping results demonstrating SceneDiffuser's ability to generate valid grasps where baselines fail. |
| MultiDex |
Success Rate |
0.00 |
71.27 |
+71.27
|
| Path planning results highlighting generalization to novel scenes in navigation tasks. |
| ScanNet (Custom) |
Success Rate |
13.50 |
73.75 |
+60.25
|
| ScanNet (Custom) |
Planning Steps |
137.98 |
90.38 |
-47.60
|
Main Takeaways
- Optimization-guided sampling dramatically increases physical plausibility (e.g., reducing collisions) without sacrificing generation diversity
- Unified framework generalizes well to long-horizon planning tasks in unseen scenes, where heuristic and imitation learning baselines struggle
- Diffusion based planning avoids the 'dead-ends' common in deterministic planners by maintaining a distribution of possible trajectories