Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

📝 Paper Summary

Interactive 3D Scene Synthesis Embodied AI Environments Neuro-symbolic 3D Generation

Scenethesis is a training-free framework that combines LLM reasoning for coarse planning with vision foundation models for spatial guidance and physics-based optimization to generate realistic, interactive 3D scenes.

Core Problem

Existing methods for text-to-3D scene generation either rely on small-scale datasets (limiting diversity) or use LLMs that lack spatial perception, resulting in unnatural layouts and physical violations.

Why it matters:

Current LLM-generated scenes often have floating objects or collisions, making them unusable for embodied AI simulation or gaming
Learning-based methods are constrained to indoor datasets like 3D-FRONT, failing to generalize to outdoor environments or novel object combinations
Manual design is unscalable, while procedural generation yields overly simplistic scenes lacking real-world functional relationships

Concrete Example: When an LLM attempts to generate a room, it might place chairs facing a cabinet instead of a table, or place a cabinet against a window (blocking it). Small objects might be restricted to tops of cabinets rather than inside shelves, or objects might interpenetrate.

Key Novelty

Vision-Perception Bridging for LLM Scene Planning

Uses 2D image generation models as a spatial prior to 'show' the LLM where objects should go, rather than relying solely on text-based spatial reasoning
Employs a novel optimization process that treats 3D layout generation as alignment between retrieved 3D assets and 2D visual guidance, enforced by semantic correspondence
Replaces standard bounding boxes with Signed Distance Fields (SDFs) for collision detection, allowing complex interactions like placing objects inside shelves

Architecture

The complete Scenethesis pipeline from input prompt to final 3D scene.

Evaluation Highlights

Outperforms Holodeck and PhyScene in physical plausibility, significantly reducing collision rates and instability
Demonstrates superior scene diversity by generating valid outdoor scenes (e.g., beaches) which dataset-dependent baselines cannot handle
Achieves higher layout realism and object interactivity scores in human evaluations compared to SOTA methods

Breakthrough Assessment

8/10

Significant advance in bridging the gap between text reasoning and 3D spatial reality without training. The move from bounding boxes to SDFs for layout optimization enables much more realistic object containment.

⚙️ Technical Details

Problem Definition

Setting: Text-to-3D Interactive Scene Generation

Inputs: Text prompt (simple description or detailed plan)

Outputs: Interactive 3D scene with arranged 3D assets

Pipeline Flow

LLM Module: Prompt -> Coarse Layout Plan
Vision Module: Plan -> Image Guidance -> Scene Graph
Optimization Module: Scene Graph -> SDF Physics & Visual Alignment -> Final Layout
Judge Module: Layout -> Verification -> (Loop if needed)

System Modules

LLM Planner

Draft coarse scene plan, identify anchor objects, and establish spatial hierarchy from text prompt

Model or implementation: Not explicitly specified (likely GPT-4 based on context)

Vision Module (Image Generation) (Visual Refinement)

Generate 2D image guidance to visualize the scene layout

Model or implementation: Image Generation Model (implied Stable Diffusion or similar)

Vision Module (Graph & Retrieval) (Visual Refinement)

Extract scene graph, estimate depth/bounding boxes, and retrieve 3D assets

Model or implementation: GPT-4o, Grounded-SAM, DepthPro

Physics-Aware Optimizer

Iteratively align objects to visual guidance and enforce physical constraints

Model or implementation: Optimization algorithm using RoMa (semantic matching) and SDFs

Scene Judge

Verify spatial coherence and trigger re-planning if metrics fall below threshold

Model or implementation: GPT-4o

Novel Architectural Elements

Integration of Vision Foundation Models to provide 'visual priors' that correct LLM spatial hallucinations
Iterative optimization loop combining Semantic Correspondence Matching (visual alignment) with SDF-based physics constraints (collision/stability)

Modeling

Base Model: Composite system using GPT-4o, Grounded-SAM, DepthPro, RoMa

Comparison to Prior Work

vs. Holodeck: Scenethesis uses vision guidance for placement and SDFs for physics, whereas Holodeck relies on LLM priors and bounding boxes (ignoring small object collisions)
vs. PhyScene: Scenethesis is training-free and handles outdoor scenes; PhyScene requires training on indoor datasets and uses relaxed bounding box constraints
vs. Text2Room/WonderWorld: Scenethesis generates interactive objects with separate meshes, whereas these methods generate single static geometries unsuitable for interaction

Limitations

Relies on the quality of the retrieved 3D assets; shape discrepancies between generated image and retrieved asset can hinder alignment
Dependent on the performance of the 2D image generation model; unrealistic 2D guidance leads to unrealistic 3D scenes
Processing speed/latency not explicitly reported but likely slower than end-to-end inference due to iterative optimization
Limited to static scene generation; does not model dynamic physics (like falling or breaking objects) post-generation

Reproducibility

Asset subset constructed from Objaverse. Custom environment map dataset used. Specific LLM prompt templates and detailed scene graph formatting provided in appendix. Code availability is 'not provided' in the main text.

📊 Experiments & Results

Evaluation Setup

Comparison of generated 3D layouts against baselines using both automated metrics and human evaluation

Benchmarks:

Indoor/Outdoor Scene Generation (Qualitative and Quantitative Assessment) [New]

Metrics:

Collision Rate (physical plausibility)
Stability Rate (physical plausibility)
Scene Diversity (qualitative)
Layout Realism (human eval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Physics-aware optimization significantly reduces physical violations compared to baselines.
Physical Plausibility	Collision Rate	15.0	Not reported in the paper	-

Experiment Figures

Visual comparison of Scenethesis vs. Pure LLM generation (Holodeck-style).

Visual explanation of the SDF-based physical optimization.

Main Takeaways

Scenethesis generates diverse scenes (indoor and outdoor) unlike dataset-constrained methods (e.g., ATISS, DiffuScene) which are limited to indoor bedrooms/living rooms.
The use of SDFs allows for complex object containment (e.g., books inside shelves) which is impossible with bounding-box-based collision methods like Holodeck.
Vision guidance effectively corrects the 'spatial common sense' errors of pure LLM approaches (e.g., preventing furniture from facing walls arbitrarily).
The framework ensures objects are 'interactive' (separate meshes, physics-ready) rather than just visual textures, enabling downstream embodied AI tasks.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs)
Vision Foundation Models (VFMs)
3D Geometry (transformations, bounding boxes)
Signed Distance Fields (SDFs)
Semantic Correspondence Matching

Key Terms

SDF: Signed Distance Field—a mathematical representation where the value at a point indicates its distance to the nearest surface, useful for precise collision detection

3DBB: 3D Bounding Box—a simplified rectangular box enclosing an object, often used for rough collision detection but inaccurate for complex shapes

Agentic Framework: An AI system where an LLM acts as a controller that plans tasks and calls other models (tools) to execute them

Anchor Object: The central reference object in a scene (e.g., a sofa in a living room) around which other objects are positioned

PBR: Physically Based Rendering—rendering techniques that model how light interacts with materials realistically

DoF: Degrees of Freedom—the number of parameters defining an object's position and orientation (e.g., 5-DoF excludes roll/pitch for upright objects)

Semantic Correspondence: Finding matching points between two images (or an image and a 3D render) based on what the object parts are, rather than just pixel color