Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

📝 Paper Summary

Architectural Layout Generation Multimodal Large Language Models Spatial Design

HouseMind enables LLMs to generate, understand, and edit floor plans by discretizing room geometries and outlines into a unified sequence of tokens processed alongside text instructions.

Core Problem

Existing layout generation models (Diffusion/GANs) lack explicit semantic reasoning for spatial hierarchy, while generic Multimodal LLMs treat layouts as pixels, failing to grasp room connectivity and structural logic.

Why it matters:

Architectural design requires complex reasoning about dependencies (e.g., adjacency, circulation) which sequential or purely visual models struggle to capture
Current tools are often 'black boxes' lacking interpretability or control, making them unsuitable for professional design workflows
Most existing systems are computationally heavy and cannot run locally, limiting practical adoption in design software

Concrete Example: When asking a generic diffusion model to 'add a bathroom next to the bedroom,' it may generate a visually plausible image that violates topological constraints (e.g., blocking a hallway) or lacks functional connectivity, whereas HouseMind modifies only the relevant tokens to ensure structural validity.

Key Novelty

Unified Room-Instance Tokenization for LLMs

Discretizes both the building outline and individual rooms into distinct token sequences using VQ-VAEs, creating a vocabulary that combines geometry with semantic labels
Treats understanding, generation, and editing as a single autoregressive sequence modeling task, allowing the LLM to 'read' and 'write' floor plans as if they were language

Architecture

The unified framework of HouseMind processing different tasks (Understanding, Generation, Editing) using shared tokenization

Evaluation Highlights

Reduces FID (Fréchet Inception Distance) from 11.3 (ChatHouseDiffusion) to 1.9, indicating significantly higher realism and geometric fidelity
Improves Micro IoU by over 10% compared to ChatHouseDiffusion, achieving 0.71 Micro IoU on the generated layouts
Reduces mean room area estimation error from several square meters (vision-language baselines) to below 0.6 m²

Breakthrough Assessment

8/10

Successfully unifies three distinct design tasks (understanding, generation, editing) into one lightweight model with superior geometric validity compared to diffusion baselines.

⚙️ Technical Details

Problem Definition

Setting: Unified sequence modeling for floor plan understanding, generation, and editing

Inputs: Text instructions s, Outline tokens z_o, and optionally existing layout tokens Z_src (for editing)

Outputs: Target layout token sequence Z_tgt (comprising interleaved semantic and geometric tokens)

Pipeline Flow

Input Processing (Text + Outline)
Tokenization (VQ-VAE Discretization)
Sequence Modeling (Multimodal LLM)
Detokenization (Layout Reconstruction)

System Modules

Outline Encoder (VQ-VAE) (Tokenization)

Discretize the binary outline mask into a sequence of tokens

Model or implementation: CNN Encoder + Quantizer

Room Encoder (Conditional VQ-VAE) (Tokenization)

Discretize individual room masks conditioned on the outline to capture adjacency

Model or implementation: Conditional CNN Encoder + Quantizer

Multimodal LLM

Autoregressively predict the next token in the sequence (handling text, outline, and room tokens jointly)

Model or implementation: Qwen3-0.6B

Geometry Decoders

Reconstruct pixel-level geometry from discrete tokens

Model or implementation: Transposed-CNN Decoders

Novel Architectural Elements

Hierarchical tokenization splitting global outline and local room instances into separate but conditioned discrete sequences
Unified vocabulary embedding where spatial codebook entries are treated as distinct tokens within the LLM's language space

Modeling

Base Model: Qwen3-0.6B

Training Method: Three-stage training: Embedding Initialization -> Multimodal Pre-training -> Instruction Tuning (SFT)

Objective Functions:

Purpose: Predict the next token in the sequence.

Formally: Autoregressive language modeling objective (Cross-Entropy Loss)

Adaptation: Full fine-tuning of the LLM backbone

Trainable Parameters: Visual codebook embeddings and LLM weights

Training Data:

Dataset: RPLAN
Split: 76,122 training samples, 2,308 validation, 2,308 test
Text descriptions generated via Qwen3-30B-A3B (mix of simple summaries and detailed specs)

Compute: Inference runs on a single NVIDIA RTX 3090 GPU

Comparison to Prior Work

vs. ChatHouseDiffusion: HouseMind uses discrete tokens for explicit room reasoning rather than purely visual denoising, resulting in better topology
vs. FloorPlanLLaMA: HouseMind explicitly separates outline and room-instance tokens rather than encoding the whole plan, preserving boundaries better
vs. General MLLMs (LLaVA/Qwen-VL): HouseMind is instruction-tuned on structured layout data, reducing hallucinations in room count and connectivity

Limitations

Dependence on the quality of the VQ-VAE reconstruction
Currently limited to 2D floor plans, not full 3D structures
Evaluation relies on post-processing to normalize scale and boundaries

Reproducibility

Code availability is not explicitly provided in the text. The paper mentions implementation details are in supplementary materials. Benchmarks are based on RPLAN.

📊 Experiments & Results

Evaluation Setup

Unified evaluation on 256x256 color-mapped layouts with wall boundaries

Benchmarks:

RPLAN-based Unified Benchmark (Layout Understanding, Generation, and Editing) [New]

Metrics:

Micro/Macro IoU (Intersection over Union)
FID (Fréchet Inception Distance)
SSIM (Structural Similarity)
Node F1 (Graph topology)
GED (Graph Edit Distance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generation performance comparison showing HouseMind's superiority in realism (FID) and geometric accuracy (IoU) compared to diffusion and other baselines.
RPLAN test set	FID	11.3	1.9	-9.4
RPLAN test set	Micro IoU	Not reported in the paper	0.71	Not reported in the paper
RPLAN test set	Mean Room Area Error	>2.0	0.6	-1.4

Experiment Figures

Qualitative comparison of HouseMind against GPT-5, Gemini 2.5 Pro, and other baselines across three tasks

Main Takeaways

HouseMind significantly outperforms diffusion-based methods (ChatHouseDiffusion) in geometric fidelity (FID, IoU), proving discrete tokenization handles spatial constraints better than continuous denoising.
The unified model (HouseMind-O) performs comparably to task-specific models, validating that a single set of weights can handle understanding, generation, and editing without degradation.
In editing tasks, HouseMind preserves the original structure (high Node F1, low GED) much better than general image-editing models like FLUX or Qwen-Image-Edit, which tend to introduce noise or irrelevant elements.

📚 Prerequisite Knowledge

Prerequisites

Vector-Quantized Variational Autoencoder (VQ-VAE)
Autoregressive Language Modeling
Architectural Floor Plan Semantics (RPLAN dataset)

Key Terms

VQ-VAE: Vector-Quantized Variational Autoencoder—a neural network that compresses high-dimensional data (like images) into discrete tokens from a learned codebook

IoU: Intersection over Union—a metric measuring the overlap between the predicted room area and the ground truth area

FID: Fréchet Inception Distance—a metric used to evaluate the quality of generated images by comparing their distribution to real images; lower is better

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset to adapt it for a particular task

GED: Graph Edit Distance—a measure of similarity between two graphs (representing room connectivity), calculating the cost to transform one into the other

SSIM: Structural Similarity Index Measure—a perceptual metric that quantifies image quality degradation caused by processing such as data compression

RPLAN: A large-scale dataset of residential floor plans used for training and benchmarking layout generation models