Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification

📝 Paper Summary

Robotic Task Planning Neuro-symbolic AI

HVR is a neuro-symbolic robotic planner that combines hierarchical decomposition, Knowledge Graph RAG for context, and formal symbolic verification to improve accuracy on long-horizon tasks.

Core Problem

LLM-based robotic planners struggle with long-horizon tasks due to poor hierarchical reasoning, lack of environment-specific knowledge, and the generation of hallucinated or logically inconsistent plans.

Why it matters:

Robots in specialized settings (e.g., healthcare, kitchens) require precision that statistical LLMs often lack.
Executing incorrect plans in physical environments can be dangerous or costly; formal correctness is essential before execution.
Existing RAG methods improve knowledge access but do not guarantee the logical validity of the generated action sequences.

Concrete Example: In a task like 'Serve wine', an LLM might generate 'pour wine' before 'pick up bottle'. Without verification, the robot fails. HVR decomposes this into macro-actions, retrieves relevant object states (e.g., bottle is corked), and uses a symbolic validator to catch the missing 'uncork' or 'pick up' steps.

Key Novelty

HVR (Hierarchical, Verification, RAG)

Integrates three distinct components: Hierarchical planning (decomposing tasks into Macro Actions then Atomic Actions), KG-RAG (retrieving dynamic object states from a Knowledge Graph), and Symbolic Verification (using PDDL to check and correct logic).
Uses the Symbolic Validator not just for pre-execution checks but also as a runtime failure detector by comparing the expected 'ideal' world state with the observed scene graph.

Architecture

The complete HVR pipeline workflow, from task input to execution.

Evaluation Highlights

HVR with Gemini achieves 94.19% Plan Correctness across all tasks, significantly outperforming the standard LLM baseline (17.72%) and other ablated versions.
On high-complexity tasks (>20 steps), HVR maintains high performance (88.39% with Gemini), whereas the standard LLM baseline drops to 3.76%.
Symbolic verification significantly boosts plan quality: Expanded Plan Verification (EPV) scores improve from 47.03% to 47.39% for Phi3 and remain high at 88.11% for Gemini after corrections.

Breakthrough Assessment

8/10

Strong integration of symbolic methods with LLMs for robotics. The comprehensive evaluation on complex long-horizon tasks (up to 40+ steps) distinguishes it from simpler block-stacking benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Generating a sequence of executable atomic actions for a robot in a kitchen environment E, given a natural language task description t and an ontology O.

Inputs: Natural language task description t (constructive or goal-oriented) and a subgraph G' from the Knowledge Graph containing relevant objects.

Outputs: A sequence of atomic actions (AAs) executable by the robot (e.g., put-on(pan, stove)).

Pipeline Flow

KG-RAG Retrieval: Extract subgraph G' with relevant objects/states
Macro Plan Generation (Policy φ): LLM generates sequence of Macro Actions (MAs)
Symbolic Verification (Macro): Validator checks MAs; LLM corrects if invalid
Expansion (Policy π): LLM expands each MA into an AA-block (Atomic Actions)
Symbolic Verification (Atomic): Validator checks AAs; LLM corrects if invalid
Execution & Monitoring: Agent executes AAs; Symbolic Validator checks alignment between expected and observed states

System Modules

Knowledge Graph RAG

Retrieves task-relevant objects and their dynamic states (e.g., is_cooked) from the Knowledge Graph to prevent hallucinations.

Model or implementation: Frozen LLM (Phi-3 or Gemini) for selection + Symbolic Querying

Macro Planner (Policy φ) (Planning)

Decomposes the high-level task into a sequence of Macro Actions (subtasks).

Model or implementation: Phi-3-mini-4k-instruct or Gemini-1.5-flash

Action Expander (Policy π) (Planning)

Expands each Macro Action into a sequence of executable Atomic Actions (AA-block).

Model or implementation: Phi-3-mini-4k-instruct or Gemini-1.5-flash

Symbolic Validator

Checks plan feasibility using PDDL. Acts as a failure detector at runtime by comparing ideal PDDL state vs. observed Scene Graph.

Model or implementation: Ad-hoc Python-based PDDL validator

Novel Architectural Elements

Dual-use Symbolic Validator: Used both for pre-execution plan verification (correcting LLM logic) and runtime failure detection (aligning ideal world state with observed scene graph).
Integration of dynamic KG-RAG specifically into a hierarchical neuro-symbolic planning pipeline.

Modeling

Base Model: Evaluated with both Phi-3-mini-4k-instruct (small) and Gemini-1.5-flash (large)

Compute: Not reported in the paper

Comparison to Prior Work

vs. HiP: HVR uses symbolic KG-based RAG and PDDL verification rather than just visual grounding.
vs. MLDT: HVR integrates dynamic Knowledge Graph retrieval to handle object state changes, unlike static decomposition.
vs. SayPlan: HVR focuses on complex manipulation tasks (cooking) rather than primarily navigation.
+ 1 more
vs. Text2Motion [not cited in paper]: HVR focuses on high-level task logic and symbolic validity rather than low-level motion primitive generation.

Limitations

LLM-based planners (even with HVR) tend to generate unnecessarily long plans with redundant steps compared to minimal ground truth.
Simulator limitations (AI2Thor) sometimes cause execution failures (95% success rate) even for formally correct plans.
Effectiveness of RAG diminishes for larger LLMs (like Gemini) which can handle larger contexts without retrieval, though it remains critical for smaller models.
Performance drops significantly for open-ended objectives compared to tasks with specific goal states.

Reproducibility

The paper uses open-source models (Phi-3) and a freely available simulator (AI2Thor). The ontology (OntoThor) is cited from prior work. Code URL is not provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Robotic task planning in the AI2Thor simulator using the OntoThor ontology.

Benchmarks:

Custom Kitchen Tasks (Long-horizon robotic manipulation) [New]

Metrics:

Plan Correctness (PC)
Execution Success (ES)
Length Discrepancy (LD)
Expanded Plan Verification (EPV)
Macro Plan Verification (MPV)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of the proposed HVR method against ablated baselines (HV, HR, VR, R, LLM) using two different LLMs (Phi3 and Gemini).
Custom Kitchen Tasks	Plan Correctness (PC)	11.79	59.66	+47.87
Custom Kitchen Tasks	Plan Correctness (PC)	17.72	94.19	+76.47
Custom Kitchen Tasks	Plan Correctness (PC)	18.62	59.66	+41.04
Custom Kitchen Tasks	Plan Correctness (PC)	49.01	94.19	+45.18
Custom Kitchen Tasks	Length Discrepancy (Avg Abs)	0	562.50	+562.50
Custom Kitchen Tasks	Macro Plan Verification (MPV)	27.41	74.20	+46.79

Main Takeaways

HVR consistently outperforms baselines across varying task complexities and LLM sizes.
For small LLMs (Phi-3), RAG is the most critical component to compensate for limited reasoning/knowledge.
For large LLMs (Gemini), Hierarchical planning and Verification are the most impactful features; RAG helps less due to large context windows.
LLM-based planners prioritize success over efficiency, often generating plans that are 100-200% longer than necessary.
Symbolic verification is highly effective: there is a strong correlation between formally verified plans and correctly executed plans.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) for planning
Knowledge of PDDL (Planning Domain Definition Language)
Familiarity with Retrieval-Augmented Generation (RAG) concepts
Robotic simulation environments (specifically AI2Thor)

Key Terms

HVR: The proposed method standing for Hierarchical planning, Verification, and RAG.

Atomic Action (AA): A low-level, executable action the robot can perform directly (e.g., 'pick-up apple').

Macro Action (MA): A high-level subtask description (e.g., 'make coffee') that expands into a sequence of atomic actions.

PDDL: Planning Domain Definition Language—a formal language used to define actions, preconditions, and effects for automated planning.

Knowledge Graph (KG): A structured representation of the environment where nodes are objects/concepts and edges are relationships (e.g., Apple is-a Sliceable).

Scene Graph: A graph representation of the visual scene captured by the robot's camera, encoding objects and spatial relations.

KG-RAG: Knowledge Graph Retrieval-Augmented Generation—using a KG to retrieve relevant context for the LLM.

AA-block: An ordered sequence of atomic actions that expands a single macro action.

OntoThor: The specific ontology used to describe the AI2Thor kitchen environment.

Plan Correctness (PC): Metric measuring the ratio of correctly planned steps aligned with ground truth up to the first error.