โ† Back to Paper List

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang, Gregory Dudek, Xueqian Wang, Jian Su
Tsinghua University, McGill University, MILA - Quebec AI Institute
arXiv (2026)
MM Agent Benchmark Reasoning

๐Ÿ“ Paper Summary

3D Scene Generation Embodied AI Simulation Procedural Content Generation
MANSION combines large language models with geometric solvers to generate interactive, building-scale multi-floor environments from text, enabling the evaluation of long-horizon embodied agents in complex spatial settings.
Core Problem
Existing embodied AI benchmarks are confined to single-floor, residential layouts that lack the vertical structure and scale required to test long-horizon spatial reasoning and planning.
Why it matters:
  • Real-world tasks (delivery, hospital transport) are inherently building-scale and multi-floor, requiring agents to navigate elevators and stairs
  • Current scene generation methods focus on single rooms or apartments, failing to model vertical constraints or cross-floor connectivity
  • Scanned 3D datasets are expensive to collect, hard to modify for specific tasks, and lack interactive semantic elements needed for simulation
Concrete Example: A standard generator might create a house where the second floor's footprint doesn't align with the first, or stairs lead to a ceiling. MANSION enforces vertical alignment and connectivity, allowing an agent to successfully navigate from a first-floor lobby to a second-floor office.
Key Novelty
Hybrid MLLM-Geometry Pipeline for Vertical Structures
  • Decouples high-level semantic planning (via MLLMs) from low-level geometric validity (via constrained solvers) to ensure structural correctness
  • Enforces vertical alignment as a hard constraint, ensuring elevator shafts and stairwells physically connect across multiple floors
  • Introduces a Task-Semantic Scene Editing Agent that modifies static environments via tool usage to satisfy preconditions for specific language-defined tasks
Architecture
Architecture Figure Figure 2
The hierarchical multi-agent framework of MANSION, showing the flow from user instruction to final 3D scene.
Evaluation Highlights
  • Generated MansionWorld dataset containing over 1,000 diverse multi-floor buildings (offices, hospitals, schools) with >10,000 rooms
  • State-of-the-art embodied agents show sharp performance degradation when moving from single-floor to multi-floor tasks in MANSION
  • Framework supports export to AI2-THOR, Blender, and NVIDIA Isaac Sim for broad simulator compatibility
Breakthrough Assessment
9/10
First framework to successfully generate navigable, building-scale multi-floor environments from language, addressing a major gap in embodied AI simulation for long-horizon planning.
×