← Back to Paper List

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Sizhong Qin, Ramon Elias Weber, Xinzheng Lu
Tsinghua University
arXiv (2026)
MM Reasoning Benchmark

📝 Paper Summary

Architectural Layout Generation Multimodal Large Language Models Spatial Design
HouseMind enables LLMs to generate, understand, and edit floor plans by discretizing room geometries and outlines into a unified sequence of tokens processed alongside text instructions.
Core Problem
Existing layout generation models (Diffusion/GANs) lack explicit semantic reasoning for spatial hierarchy, while generic Multimodal LLMs treat layouts as pixels, failing to grasp room connectivity and structural logic.
Why it matters:
  • Architectural design requires complex reasoning about dependencies (e.g., adjacency, circulation) which sequential or purely visual models struggle to capture
  • Current tools are often 'black boxes' lacking interpretability or control, making them unsuitable for professional design workflows
  • Most existing systems are computationally heavy and cannot run locally, limiting practical adoption in design software
Concrete Example: When asking a generic diffusion model to 'add a bathroom next to the bedroom,' it may generate a visually plausible image that violates topological constraints (e.g., blocking a hallway) or lacks functional connectivity, whereas HouseMind modifies only the relevant tokens to ensure structural validity.
Key Novelty
Unified Room-Instance Tokenization for LLMs
  • Discretizes both the building outline and individual rooms into distinct token sequences using VQ-VAEs, creating a vocabulary that combines geometry with semantic labels
  • Treats understanding, generation, and editing as a single autoregressive sequence modeling task, allowing the LLM to 'read' and 'write' floor plans as if they were language
Architecture
Architecture Figure Figure 2
The unified framework of HouseMind processing different tasks (Understanding, Generation, Editing) using shared tokenization
Evaluation Highlights
  • Reduces FID (Fréchet Inception Distance) from 11.3 (ChatHouseDiffusion) to 1.9, indicating significantly higher realism and geometric fidelity
  • Improves Micro IoU by over 10% compared to ChatHouseDiffusion, achieving 0.71 Micro IoU on the generated layouts
  • Reduces mean room area estimation error from several square meters (vision-language baselines) to below 0.6 m²
Breakthrough Assessment
8/10
Successfully unifies three distinct design tasks (understanding, generation, editing) into one lightweight model with superior geometric validity compared to diffusion baselines.
×