Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams

📝 Paper Summary

Neuro-symbolic planning Multi-robot coordination

Scale-Plan improves planning scalability by using a pre-computed action graph to filter irrelevant objects and actions from the environment before using an LLM to decompose and allocate tasks.

Core Problem

Long-horizon planning in object-rich environments fails because irrelevant objects bloat the search space and cause LLMs to hallucinate or produce malformed PDDL problem files.

Why it matters:

In cluttered households, most sensory data (e.g., a tomato when the task is 'switch off light') is irrelevant, yet traditional planners ingest everything, causing combinatorial explosion
Pure LLM planners struggle with context limits and grounding, while hybrid approaches like LLM+P often fail to generate valid PDDL files when the scene description is noisy or overly detailed

Concrete Example: In a task to 'place the apple in the fridge and turn off the light,' standard approaches might include irrelevant objects like pots or dustbins in the problem definition. This causes the planner to explore useless interactions or the LLM to hallucinate constraints for the dustbin, leading to planning failure.

Key Novelty

Domain-Level Action Graph Filtering

Constructs a static directed graph of actions from the PDDL domain offline, where edges represent predicate dependencies (one action enables another)
At runtime, uses shallow LLM reasoning to identify goal actions, then performs a backward graph search to select ONLY the predecessor actions and objects strictly necessary for the task
Feeds this minimized, task-relevant sub-domain to the planning LLM, bypassing the need for full environment grounding or explicit PDDL problem file generation

Architecture

The overall architecture of Scale-Plan, divided into Offline and Runtime phases.

Evaluation Highlights

Outperforms strongest baseline (LaMMA-P LLM-corrected) by 25% in Task Completion Rate (TCR) overall on the MAT2-THOR benchmark
Achieves +35% improvement in TCR on 'Complex' tasks compared to LaMMA-P (LLM-corrected), demonstrating superior scalability in long-horizon scenarios
Maintains 9% higher Executability Rate (ER) than baselines, indicating generated plans are more robust to low-level simulator failures

Breakthrough Assessment

8/10

Significant improvement in handling cluttered environments for multi-robot systems by solving the 'irrelevant context' problem via structured graph search. The removal of intermediate PDDL generation for the problem file is a strong design choice.

⚙️ Technical Details

Problem Definition

Setting: Long-horizon task planning for heterogeneous multi-robot systems in object-rich environments

Inputs: Natural language task instruction T, PDDL domain D, set of objects O, initial state I

Outputs: Executable multi-robot plan (sequence of actions allocated to specific robots)

Pipeline Flow

Offline: Action Graph Construction
Runtime: Relevance Filtering (LLM + Graph Search)
Planning: Task Decomposition → Allocation → Integration

System Modules

Action Graph Constructor

Builds a dependency graph of domain actions based on preconditions and effects

Model or implementation: Rule-based (Strict and Relaxed dependency rules)

Relevance Filter

Identifies minimal set of relevant actions and objects

Model or implementation: GPT-5.2 (Shallow reasoning) + Backward DFS

Task Decomposer (Planning Pipeline)

Breaks instruction into sub-tasks using filtered domain

Model or implementation: GPT-5.2

Task Allocator (Planning Pipeline)

Assigns sub-tasks to robots based on capabilities

Model or implementation: GPT-5.2

Novel Architectural Elements

Integration of an offline-computed Action Graph with runtime LLM reasoning to structurally filter the planning environment
Direct synthesis of multi-robot plans from filtered representations without generating an intermediate PDDL problem file

Modeling

Base Model: GPT-5.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. LaMMA-P: Scale-Plan filters the environment BEFORE planning and skips PDDL problem file generation, preventing errors from noisy state descriptions
vs. LLM+P: Scale-Plan handles heterogeneous robot allocation and decomposition explicitly, rather than relying on a single PDDL solver which may fail on large problem files
vs. Scene Graph Approaches [not cited in paper]: Scale-Plan filters at the ACTION domain level rather than just the object/scene graph level

Limitations

Absence of explicit environmental grounding can still lead to hallucinations (e.g., assuming an object is open)
Vague tasks rely heavily on the LLM's ability to infer implicit goals, which can be inconsistent
Requires a predefined PDDL domain specification; cannot handle open-world actions outside the domain

Reproducibility

The paper introduces the MAT2-THOR benchmark (cleaned MAT-THOR) but does not provide a direct URL for the code or benchmark in the text. The LLM used is GPT-5.2. Action graph rules are explicitly defined in the methodology.

📊 Experiments & Results

Evaluation Setup

Multi-robot task planning in AI2-THOR simulator using the MAT2-THOR benchmark

Benchmarks:

MAT2-THOR (Long-horizon multi-robot manipulation) [New]

Metrics:

Task Completion Rate (TCR)
Goal Condition Recall (GCR)
Executability Rate (ER)
Planning Time (PT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scale-Plan achieves superior performance across all metrics compared to state-of-the-art baselines, with particularly large gains on complex tasks.
MAT2-THOR	Task Completion Rate (TCR)	0.53	0.78	+0.25
MAT2-THOR (Complex Tasks)	Task Completion Rate (TCR)	0.24	0.59	+0.35
MAT2-THOR	Executability Rate (ER)	Not reported in the paper	Not reported in the paper	+0.09
MAT2-THOR (Complex Tasks)	Task Completion Rate (TCR)	0.41	0.59	+0.18

Experiment Figures

Visual explanation of the Strict vs. Relaxed rules for edge generation in the Action Graph.

Main Takeaways

Environment filtering is critical: Removing the action-graph filtering causes a massive drop in performance (e.g., -18% TCR on complex tasks), proving that irrelevant context confuses the planner.
Structured planning pipeline (Decomposition -> Allocation) is superior to joint planning (doing everything in one prompt), as seen in ablation results.
Scale-Plan incurs higher planning time than pure LLM approaches due to multiple inference steps, but yields significantly higher success rates, representing a worthwhile trade-off.
Common failures involve missing affordance checks (e.g., not opening a cabinet before placing an item), suggesting a need for better semantic state modeling.

📚 Prerequisite Knowledge

Prerequisites

Understanding of PDDL (Planning Domain Definition Language) and STRIPS
Familiarity with graph search algorithms (DFS)
Basic knowledge of Large Language Models (LLMs) in robotics

Key Terms

PDDL: Planning Domain Definition Language—a standard encoding for planning problems using predicates, actions, preconditions, and effects

Action Graph: A directed graph where nodes are parameterized actions and edges represent logical dependencies (e.g., action A produces an effect required by action B)

Grounding: The process of mapping symbolic representations (like 'apple') to specific, actionable objects in the simulator or real world

Hallucination: When an LLM generates plausible-sounding but factually incorrect information, such as inventing objects that don't exist in the scene

Heterogeneous Multi-Robot Systems: Teams of robots where different members have different physical capabilities (e.g., one can fly, one can manipulate objects)

TCR: Task Completion Rate—percentage of tasks where all ground-truth goal conditions are satisfied

GCR: Goal Condition Recall—proportion of goal conditions achieved, averaged across tasks

ER: Executability Rate—percentage of planned actions that successfully execute in the simulator

DFS: Depth-First Search—an algorithm for traversing or searching tree or graph data structures