Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating the Hallucination for Path Planning

📝 Paper Summary

Spatial reasoning Embodied AI agent Path planning

S2RCQL improves LLM maze navigation by converting raw coordinates into explicit entity relations to fix spatial hallucinations and using curriculum Q-learning to address long-term reasoning inconsistencies.

Core Problem

LLMs struggle with spatial reasoning and long-term path planning in maze environments due to spatial hallucinations (misunderstanding coordinates) and context inconsistency hallucinations (losing track during long reasoning chains).

Why it matters:

Standard prompt engineering (CoT) and even memory-augmented methods (Rememberer) often fail in simple mazes because LLMs intuitively favor shortest straight-line paths, ignoring obstacles.
Spatial reasoning is foundational for embodied intelligence, yet current LLMs perform poorly on these tasks compared to humans.

Concrete Example: In a maze with obstacles, an LLM might try to move from (1,0) to (1,1) because the coordinates look similar or geometrically close, even if a wall exists between them. Standard CoT agents often get stuck in forbidden zones or lose direction after a few steps.

Key Novelty

Spatial-to-Relational Transformation and Curriculum Q-Learning (S2RCQL)

Transforms implicit spatial coordinates (e.g., '(0,0) to (1,0)') into explicit entity relations (e.g., 'Node A connected to Node F') to prevent LLMs from hallucinating based on coordinate similarity.
Integrates Q-learning directly into the prompt context: the agent retrieves Q-values for state-action pairs to guide decision-making, replacing random exploration with LLM prior knowledge.
Uses Reverse Curriculum Learning (RCL) to generate simplified intermediate starting points, allowing the LLM to learn from easy-to-hard tasks and reduce reasoning chain length.

Evaluation Highlights

Outperforms the 'Rememberer' baseline by 25%–40% in Success Rate across 5x5, 7x7, and 10x10 mazes.
Achieves 23%–30% higher Optimality Rate (finding shortest paths) compared to Rememberer.
Removing the Spatial-to-Relational (S2R) module causes a ~15% drop in success rates, validating its role in mitigating spatial hallucination.

Breakthrough Assessment

7/10

Novel combination of symbolic transformation (spatial-to-relational) and RL-guided prompting. Significant empirical gains on maze tasks, though tested on a specific proprietary LLM (ERNIE-Bot) rather than open models.

⚙️ Technical Details

Problem Definition

Setting: Maze path planning where an agent must navigate from a start node to a goal node while avoiding obstacles.

Inputs: Textual description of maze size, obstacles, start point, and goal.

Outputs: A sequence of moves (path) from start to goal.

Pipeline Flow

Environment Parser (LLM extracts maze info to JSON)
Graph Constructor (Python converts JSON to connectivity graph)
Curriculum Generator (LLM or heuristic generates intermediate start points)
Agent Loop (State retrieval -> Prompt construction -> LLM Action selection -> Update)

System Modules

Environment Parser (Input Processing)

Extract structure from text description

Model or implementation: ERNIE-Bot 4.0

Graph Constructor (Input Processing)

Convert spatial coordinates to relational graph

Model or implementation: Python Script

Curriculum Generator

Generate intermediate goals to simplify the task

Model or implementation: ERNIE-Bot 4.0 or Hand-crafted heuristic

Q-Learning Agent

Select next move based on context and Q-values

Model or implementation: ERNIE-Bot 4.0

Novel Architectural Elements

Integration of explicit Q-values into the LLM prompt context to guide decision making.
Substitution of random RL exploration with LLM-based prior knowledge exploration.
Hybrid pipeline converting spatial data to symbolic relational data before reasoning.

Modeling

Base Model: ERNIE-Bot 4.0

Training Method: Q-learning (Tabular RL approach assisted by LLM)

Objective Functions:

Purpose: Update Q-values based on rewards.

Formally: Q(s,a) <- Q(s,a) + alpha * [r + gamma * max(Q(s',a')) - Q(s,a)]

Key Hyperparameters:

reward_step: -1
reward_goal: 30
epsilon: Variable (epsilon-greedy)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Rememberer: S2RCQL adds explicit spatial-to-relational transformation and curriculum learning, whereas Rememberer relies on raw memory and CoT.
vs. CoT/ReAct: S2RCQL uses an external Q-value table and curriculum structure to guide long-term planning, preventing the context loss common in standard prompting.
vs. Standard Q-Learning: S2RCQL replaces random exploration with LLM-guided exploration and uses natural language prompts for state representation.

Limitations

Relies on a proprietary model (ERNIE-Bot 4.0), limiting reproducibility.
Performance degrades as maze size increases (Success/Optimality drop in 10x10 mazes).
Requires converting the environment to a text-based graph, which may not scale to continuous or highly complex 3D spaces.

Reproducibility

Code availability is not provided. The paper uses a closed-source model (ERNIE-Bot 4.0) and does not specify exact prompt templates or temperature settings.

📊 Experiments & Results

Evaluation Setup

Maze path planning simulation using OpenAI Gym.

Benchmarks:

5x5 Mazes (Path Planning) [New]
7x7 Mazes (Path Planning) [New]
10x10 Mazes (Path Planning) [New]

Metrics:

Success Rate
Optimality Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
S2RCQL significantly outperforms baselines in both success rate and optimality across all maze sizes.
5x5 Mazes	Success Rate	73.3	96.7	+23.4
7x7 Mazes	Success Rate	60.0	90.0	+30.0
10x10 Mazes	Success Rate	40.0	80.0	+40.0
5x5 Mazes	Optimality Rate	66.7	90.0	+23.3
Averaged across mazes	Success Rate	75.0	90.0	+15.0

Main Takeaways

Explicitly converting spatial coordinates to entity relations (S2R) prevents hallucinations where LLMs confuse similar coordinates.
Curriculum learning is crucial; removing it drops performance by ~20%, especially in larger mazes.
LLM-generated curricula improve performance by ~10% over no curriculum, but hand-crafted curricula are still superior for complex tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (Q-learning, states, actions, rewards)
Prompt Engineering techniques (CoT, few-shot)
Curriculum Learning concepts

Key Terms

Spatial-to-Relational Transformation: Converting grid coordinates (e.g., 1,0) into abstract node labels (e.g., Node F) and explicit connectivity lists to remove geometric bias.

Context Inconsistency Hallucination: Errors occurring during long reasoning chains where the model contradicts its previous context or loses track of the current state.

Spatial Hallucination: The tendency of LLMs to misunderstand spatial relationships, often assuming connectivity based on coordinate similarity rather than actual map structure.

Reverse Curriculum Learning: A learning strategy that starts training with tasks close to the goal (easy) and iteratively moves the starting point further away (harder).

Q-learning: A model-free reinforcement learning algorithm that learns the value of an action in a particular state.

Experience Replay Buffer: A memory mechanism that stores past experiences (state, action, reward, next state) to stabilize training by reusing them.

Epsilon-greedy: A policy where the agent chooses the best-known action most of the time but explores random actions with probability epsilon.