Understanding the planning of LLM agents: A survey

📝 Paper Summary

Agentic AI Planning for LLM Agents

This survey provides a systematic taxonomy of planning methodologies for LLM-based agents, categorizing them into decomposition, selection, external assistance, reflection, and memory, while evaluating representative methods on interactive benchmarks.

Core Problem

Conventional planning methods (symbolic, RL) struggle with flexibility and sample efficiency, while LLMs show promise but lack a unified framework for understanding their planning capabilities.

Why it matters:

Autonomous agents need robust planning to handle complex, multi-step tasks in real-world environments.
Existing surveys focus on general LLM capabilities or tool learning, lacking specific depth on the planning mechanisms critical for agent autonomy.
Researchers need a structured view to identify gaps like hallucinations, inefficiency, and lack of fine-grained evaluation in current agent planning.

Concrete Example: In a complex task like 'make breakfast', a standard LLM might hallucinate steps or fail to check if ingredients exist. Decomposing this into 'buy eggs', 'cook eggs', etc., and reflecting on failures (e.g., 'no eggs found') improves success rates significantly.

Key Novelty

Five-Category Taxonomy for LLM-Agent Planning

Task Decomposition: Divide complex goals into sub-goals (Decomposition-First vs. Interleaved).
Multi-plan Selection: Generate multiple reasoning paths and select the best via search algorithms (e.g., Tree of Thoughts).
External Planner-Aided: Offload constraint handling to symbolic solvers (PDDL) or neural planners while using LLM for formalization.
Reflection & Refinement: Iteratively improve plans based on self-generated feedback or environmental signals.
Memory-Augmented: Retrieve past experiences or domain knowledge to guide current planning (RAG-based vs. Embodied/Fine-tuned).

Architecture

A taxonomy diagram of LLM-Agent planning, dividing the field into five branches: Task Decomposition, Plan Selection, External Module, Reflection, and Memory.

Evaluation Highlights

+14% success rate improvement for Reflexion over ReAct on ALFWorld (0.71 vs 0.57).
Reflexion achieves highest success rate (0.71) but incurs highest token expense ($220.17) on ALFWorld compared to baselines.
ZeroShot-CoT fails significantly on QA tasks (0.01 SR on HotPotQA) compared to few-shot methods (0.32 SR), highlighting the need for examples.

Breakthrough Assessment

7/10

A comprehensive survey that structures a rapidly growing field. While it doesn't propose a new model, its taxonomy and comparative evaluation of existing methods provide valuable insights for researchers.

⚙️ Technical Details

Problem Definition

Setting: Given an environment E, a goal g, and parameters Θ, generate a sequence of actions p = (a_0, ..., a_t) to achieve the goal.

Inputs: Environment state E, task goal g, prompt P

Outputs: Plan p consisting of a sequence of actions a_t

Pipeline Flow

Input Task/Goal
Planning Module (Decomposition / Selection / External / Reflection / Memory)
Action Execution
Environment Feedback

System Modules

Task Decomposition (Planning Strategy)

Break down complex goals into sub-goals

Model or implementation: Various (e.g., GPT-4 in HuggingGPT)

Multi-plan Selection (Planning Strategy)

Generate and evaluate multiple candidate plans

Model or implementation: LLM as generator and evaluator

External Planner (Planning Strategy)

Solve formalized problems efficiently

Model or implementation: Symbolic solver (Fast-Downward) or Neural Planner

Reflection

Critique and correct plans based on feedback

Model or implementation: LLM as reflector

Novel Architectural Elements

Taxonomy categorizing agent planning into 5 distinct architectural patterns
Integration of external symbolic planners with LLM as translator/formalizer
Self-reflection loops treating verbal feedback as reinforcement signal

Modeling

Base Model: text-davinci-003 (for evaluation experiments)

Training Method: Prompt Engineering and In-Context Learning (Survey covers fine-tuning methods but experiments use prompting)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PDDL-only planners: LLM-Agents handle open-ended natural language tasks but lack guarantees; Hybrid methods (LLM+P) combine both.
vs. RL-only agents: LLM-Agents are sample-efficient (few-shot) but costly at inference; RL agents require massive training data.
vs. Standard CoT: Advanced agents add loops (Reflexion), search (ToT), or tools (ReAct) to enhance the basic reasoning chain.

Limitations

Hallucinations lead to infeasible plans or non-existent items.
High computational cost and latency for multi-step reasoning and reflection (e.g., Reflexion expenses).
Limited context length restricts handling of very long task horizons.
Evaluation relies mostly on success rates, lacking fine-grained step-wise metrics.

Reproducibility

No replication artifacts mentioned in the paper. The paper surveys existing works and runs standard benchmarks using API-based models, but does not provide a specific repository for the survey's experimental harness.

📊 Experiments & Results

Evaluation Setup

Evaluation of representative prompt-based methods on interactive environments

Benchmarks:

ALFWorld (Text-based interactive game / household tasks)
ScienceWorld (Text-based scientific experiment simulation)
HotPotQA (Multi-hop Question Answering)
FEVER (Fact extraction and verification)

Metrics:

Success Rate (SR)
Expenses (EX) in dollars (OpenAI API)
Average Rewards (AR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on ALFWorld shows Reflexion outperforming others but at higher cost.
ALFWorld	Success Rate (SR)	0.57	0.71	+0.14
ALFWorld	Expenses ($)	152.18	220.17	+67.99
QA benchmarks highlight the failure of zero-shot approaches and the sufficiency of simpler methods.
HotPotQA	Success Rate (SR)	0.01	0.32	+0.31
ScienceWorld	Average Reward (AR)	15.05	19.39	+4.34

Main Takeaways

Performance generally correlates with expense: methods like Reflexion that consume more tokens (via iterative loops) achieve higher success rates.
Few-shot examples are critical: Zero-shot performance is near zero on complex QA tasks like HotPotQA, indicating need for guidance.
Reflection is a powerful mechanism: The ability to self-correct (Reflexion) consistently outperforms linear planning (ReAct, CoT) across benchmarks.
SayCan performs poorly on text-only benchmarks compared to CoT/ReAct, likely due to its design focus on robotic affordances not fully captured in these text simulations.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and prompting strategies (CoT, ReAct)
Familiarity with autonomous agents and planning concepts
Basic knowledge of reinforcement learning and symbolic planning (PDDL)

Key Terms

PDDL: Planning Domain Definition Language—a standard encoding language for classical planning problems used by symbolic solvers

CoT: Chain-of-Thought—a prompting technique encouraging the model to generate intermediate reasoning steps

ReAct: Reasoning and Acting—a paradigm where the model alternates between generating reasoning traces and executing actions in an environment

ToT: Tree of Thoughts—a method where the LLM explores multiple reasoning paths in a tree structure, evaluating states to guide search

RAG: Retrieval-Augmented Generation—enhancing model output by retrieving relevant information from an external knowledge base

Zero-shot CoT: Triggering reasoning with the prompt 'Let's think step-by-step' without providing example demonstrations

MCTS: Monte Carlo Tree Search—a heuristic search algorithm for decision processes, often used in game playing

Hallucination: The generation of factually incorrect or nonsensical information by an LLM, often leading to infeasible plans

Symbolic Planner: A classical AI system that uses logic and formal symbols (like PDDL) to find a guaranteed path to a goal

Embodied Memory: Fine-tuning the LLM weights with historical interaction data to internalize experiences