Diffusion-based Generation, Optimization, and Planning in 3D Scenes

📝 Paper Summary

3D Scene Understanding Physics-aware Generation Motion Planning

SceneDiffuser unifies 3D generation, optimization, and planning into a single framework where physics constraints and task goals act as differentiable guidance during a diffusion-based denoising process.

Core Problem

Existing 3D scene understanding models treat generation, optimization, and planning as disparate tasks, leading to mode collapse in generation and inconsistent or physically implausible results when applying post-hoc optimization.

Why it matters:

Generative models like cVAEs often ignore 3D scene conditions (posterior collapse), resulting in objects penetrating walls or floating
Separating planning from generation prevents agents from utilizing learned data priors for long-horizon tasks, causing failures in novel scenes
Post-processing outputs with physics optimizers is slow and often breaks the semantic consistency of the original generation

Concrete Example: A cVAE might generate a human pose that intersects with a sofa. Applying a separate physics optimizer might snap the human to the surface but result in an unnatural, twisted posture because the optimizer doesn't understand human kinematics.

Key Novelty

Unified Diffusion-based Sampling for 3D Tasks

Models 3D trajectories as a diffusion process, replacing ad-hoc planners and optimizers with a single iterative sampling loop
Injects physics (collision/contact) and goals (target location) as differentiable guidance gradients at each denoising step, rather than as hard constraints or post-processing
Uses the forward diffusion process as data augmentation to cover diverse modes, mitigating the posterior collapse common in cVAE baselines

Architecture

The SceneDiffuser architecture showing how scene conditions and trajectories interact via attention mechanisms.

Evaluation Highlights

Achieves 49.35% physically plausible rate in human pose generation, surpassing cVAE baselines (14.64%) by +34.7 percentage points
Attains 71.27% success rate in dexterous grasp generation where cVAE with test-time optimization fails completely (0.00%) due to strict physics checks
Outperforms Behavior Cloning (0% success) and heuristic planners (13.5%) in 3D navigation path planning with a 73.75% success rate on unseen scenes

Breakthrough Assessment

8/10

Significantly unifies three distinct 3D tasks into one elegant framework with massive empirical gains, particularly in physical plausibility and planning success.

⚙️ Technical Details

Problem Definition

Setting: Conditional trajectory generation and optimization in 3D scenes

Inputs: 3D scene point cloud S, optional goal G, optional start state s_0

Outputs: Trajectory tau (sequence of states/actions) that is physically plausible and reaches the goal

Pipeline Flow

Scene Encoder (Extracts features)
Trajectory Initialization (Gaussian Noise)
Iterative Denoising Loop (Applies Physics/Goal Guidance)
Output Trajectory

System Modules

Scene Encoder

Extracts hierarchical features from the 3D scene point cloud

Model or implementation: Point Transformer or PointNet

Diffusion Generator

Predicts the noise/trajectory at the previous timestep conditioned on scene features

Model or implementation: Transformer or ResNet-based denoising network

Guidance Optimizer

Calculates gradients of physics and planning objectives to steer sampling

Model or implementation: Differentiable cost functions (Analytical or Learned)

Novel Architectural Elements

Integration of differentiable physics-based objectives (collision, contact) directly into the diffusion sampling loop as guidance terms
Unified formulation where planning is treated as a conditional inpainting task within the generative model

Modeling

Base Model: Custom Diffusion Model with Transformer/ResNet backbone

Training Method: Diffusion Training (Denoising Score Matching) + Guidance Learning

Objective Functions:

Purpose: Train the base generator to reconstruct trajectories from noise.

Formally: MSE loss L = E[|| epsilon - epsilon_theta ||^2]
Purpose: (Optional) Learn optimization/planning objectives if not analytically defined.

Formally: MSE between predicted gradient and true gradient of cost function

Training Data:

PROX (Human Pose)
LEMO (Human Motion)
MultiDex (Grasp)
ScanNet (Navigation graphs)
MoveIt (Robot Arm trajectories)

Key Hyperparameters:

guidance_scale_lambda: 1.0 (optimal for plausibility)
diffusion_steps: Not explicitly reported in the paper
optimizer: Adam

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. cVAE: Diffusion process prevents posterior collapse; guidance allows strictly better physics compliance
vs. Optimization-based: Applies constraints gradually during generation rather than breaking consistency after generation
vs. Diffuser (Janner et al.) [cited]: Extends diffusion planning to 3D scene understanding with explicit physics guidance
+ 1 more
vs. GendexGrasp [cited]: Replaces cVAE with diffusion for better grasp diversity and success

Limitations

Slow training and inference speed compared to one-step generative models (common to diffusion)
Performance depends heavily on the design and tuning of objective functions (e.g., guidance scale)
Requires explicit differentiable objectives for physics guidance, which may be complex to define for all tasks

Reproducibility

Code: https://scenediffuser.github.io

Project page and code linked (https://scenediffuser.github.io). Uses standard datasets (PROX, ScanNet). Explicit mathematical formulation of guidance provided.

📊 Experiments & Results

Evaluation Setup

Evaluated across 5 diverse 3D tasks: human pose/motion generation, grasp generation, navigation, and arm motion planning.

Benchmarks:

PROX / LEMO (Human Pose & Motion Generation)
MultiDex (Dexterous Grasp Generation)
ScanNet (Custom graphs) (3D Navigation Path Planning) [New]
MoveIt (Simulated) (Robot Arm Motion Planning) [New]

Metrics:

Plausible Rate (Human & Auto)
Non-collision Score
Success Rate
Diversity (APD)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human pose generation results showing superior physical plausibility compared to cVAE baselines.
PROX	Plausible Rate	14.64	49.35	+34.71
PROX	Non-collision Score	99.75	99.93	+0.18
Dexterous grasping results demonstrating SceneDiffuser's ability to generate valid grasps where baselines fail.
MultiDex	Success Rate	0.00	71.27	+71.27
Path planning results highlighting generalization to novel scenes in navigation tasks.
ScanNet (Custom)	Success Rate	13.50	73.75	+60.25
ScanNet (Custom)	Planning Steps	137.98	90.38	-47.60

Experiment Figures

Qualitative comparison of human poses generated by cVAE vs. SceneDiffuser in indoor scenes.

Visualization of path planning trajectories in ScanNet scenes.

Main Takeaways

Optimization-guided sampling dramatically increases physical plausibility (e.g., reducing collisions) without sacrificing generation diversity
Unified framework generalizes well to long-horizon planning tasks in unseen scenes, where heuristic and imitation learning baselines struggle
Diffusion based planning avoids the 'dead-ends' common in deterministic planners by maintaining a distribution of possible trajectories

📚 Prerequisite Knowledge

Prerequisites

Diffusion Probabilistic Models (forward/reverse processes)
Conditional Variational Autoencoders (cVAE)
Trajectory Optimization
Classifier Guidance

Key Terms

cVAE: Conditional Variational Autoencoder—a generative model often used as a baseline, known for struggling to effectively use complex conditions (posterior collapse)

Posterior Collapse: A failure mode in VAEs where the decoder ignores the latent variable and generates generic outputs regardless of the input condition

SDF: Signed Distance Function—a geometric representation used to calculate collision costs; positive values usually indicate being outside an object, negative inside

Classifier Guidance: A technique in diffusion models where gradients from an external function (like a classifier or cost function) steer the generation process

BC: Behavior Cloning—an imitation learning approach that trains a policy to mimic expert demonstrations via supervised learning

Inpainting: A technique to fill in missing parts of data; used here to generate a trajectory connecting a fixed start and goal state

TTA: Test-Time Adaptation/Optimization—refining a model's output during inference, often used as a baseline strategy for fixing physics violations