AdaPlanner: Adaptive Planning from Feedback with Language Models

📝 Paper Summary

Multi-call tool use with flexible plan Closed-loop planning

AdaPlanner is a closed-loop agent that uses LLMs to generate code-based plans and adaptively refines them in response to environmental feedback without requiring auxiliary training.

Core Problem

Existing LLM agents are either open-loop (rigid plans that fail upon unexpected changes) or implicit closed-loop (correcting only the immediate action without updating the future plan), leading to myopia and error propagation.

Why it matters:

Open-loop systems cannot handle dynamic environments where actions may fail or have unexpected outcomes
Implicit closed-loop methods (like ReAct) make local adjustments but stick to an outdated high-level plan, leading to suboptimal long-term behavior
Training-based plan refiners (like DEPS) require expensive task-specific data and struggle to generalize

Concrete Example: In a task to 'clean lettuce', an agent might try to clean it while holding it at a countertop. If the environment returns 'failure', an implicit agent might just retry. AdaPlanner recognizes the failure via an assertion, pauses, and rewrites the plan to first 'go to sinkbasin' before cleaning.

Key Novelty

Adaptive Closed-Loop Planning via Code and Skill Discovery

Refine-then-resume: The agent generates plans as Python code with built-in assertions; when an assertion fails (out-of-plan feedback), it rewrites the entire future plan and resumes from that specific step rather than restarting
Dual-role LLM: The LLM acts as both a planner (decomposing tasks) and a refiner (parsing observations via a special 'ask_LLM' function to extract key info for next steps)
Skill Discovery: Successful plans are stored and used as few-shot exemplars for similar future tasks, reducing the need for extensive human demonstrations

Architecture

Comparison of Open-Loop, Implicit Closed-Loop, and AdaPlanner (Explicit Closed-Loop) architectures.

Evaluation Highlights

+3.73% success rate improvement on ALFWorld compared to state-of-the-art baselines (Reflexion) while using 2x fewer samples
91.11% success rate on MiniWoB++ tasks with feedback, outperforming RCI and significantly outperforming fine-tuned models like WebN-T5-3B
Maintains competitive performance (93.22%) on MiniWoB++ tasks without feedback, comparable to CC-Net but using ~600x fewer samples
Resilient to hallucination: Code-based prompting reduces hallucinations compared to natural language baselines, especially with weaker models like GPT-3.5

Breakthrough Assessment

8/10

Significantly improves sample efficiency and reliability in sequential decision making by treating plans as code and allowing mid-episode correction. The refine-then-resume mechanism is a strong practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Text-based sequential decision-making where an agent interacts with an environment to solve a task g

Inputs: Task description g, initial observation o1, and allowed action space A

Outputs: A sequence of actions to complete the task, dynamically generated via a planning policy ρ and action policy π

Pipeline Flow

Plan Generation (Code): LLM generates a Python function 'solution()' with sub-goals and assertions
Execution & Monitoring: Environment executes code; 'ask_LLM()' handles In-Plan Feedback; 'assert' checks Out-of-Plan Feedback
Refinement (If needed): If assertion fails, Refiner (LLM) receives error log, generates new plan, and resumes from breakpoint
Skill Memory: Successful plans are stored and retrieved as few-shot prompts for future tasks

System Modules

Planner

Generate initial Python code plan decomposing task into sub-goals with assertions

Model or implementation: GPT-3 (text-davinci-002) or GPT-3.5 (gpt-3.5-turbo)

Refiner

Modify the plan when execution fails (assertion error) or parse observations (in-plan)

Model or implementation: Same as Planner

Environment Interface

Execute Python code, map function calls to environment actions, return observations

Model or implementation: Deterministic code execution

Skill Memory

Store successful trajectories and retrieve them as prompts for new tasks

Model or implementation: Retrieval mechanism (implicit in prompt construction)

Novel Architectural Elements

Refine-then-resume mechanism: Dynamically rewriting future code blocks upon assertion failure while preserving program state to resume execution mid-episode
Hybrid feedback handling: Distinguishing between 'ask_LLM' (for expected info extraction) and 'assert' (for unexpected failure handling) within a single code structure

Modeling

Base Model: GPT-3 (text-davinci-002) and GPT-3.5 (gpt-3.5-turbo)

Training Method: In-context learning (Prompting only)

Compute: Inference only. No training.

Comparison to Prior Work

vs. ReAct: AdaPlanner modifies the *future* plan structure via code rewriting, whereas ReAct only decides the immediate next action.
vs. Reflexion: AdaPlanner refines plans *within* the current episode (online), while Reflexion requires resetting the episode to apply lessons learned.
vs. DEPS: AdaPlanner is training-free and relies solely on prompting, whereas DEPS trains a plan selector.
+ 1 more
vs. ProgPrompt [not cited in paper]: ProgPrompt generates code plans but does not include the adaptive 'refine-then-resume' mechanism for mid-execution correction.

Limitations

Still requires a few expert demonstrations (few-shot) for complex tasks, though fewer than baselines.
Performance depends heavily on the underlying LLM's code generation capability.
Skill discovery relies on the assumption that stored plans generalize well to similar tasks.

Reproducibility

Code: https://github.com/haotiansun14/AdaPlanner

📊 Experiments & Results

Evaluation Setup

Text-based embodied agents and web navigation

Benchmarks:

ALFWorld (Text-based household robotics tasks (Pick, Clean, Heat, Cool, Examine, Pick Two))
MiniWoB++ (Web interaction tasks (filling forms, clicking buttons))

Metrics:

Success Rate (%)
Sample Efficiency (Number of demonstrations used)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ALFWorld Results: AdaPlanner outperforms baselines using significantly fewer samples.
ALFWorld	Success Rate	88.06	91.79	+3.73
ALFWorld	Success Rate	61.94	91.79	+29.85
ALFWorld	Success Rate	37.00	91.79	+54.79
MiniWoB++ Results: High performance with extreme sample efficiency.
MiniWoB++ (With feedback)	Success Rate	81.56	91.11	+9.55
MiniWoB++ (With feedback)	Success Rate	38.50	91.11	+52.61
ALFWorld	Success Rate	46.00	81.00	+35.00
ALFWorld	Success Rate	19.00	38.00	+19.00

Experiment Figures

A concrete example of AdaPlanner's Python code generation and refinement process on an ALFWorld task.

Success rate vs. Number of samples for various methods.

Main Takeaways

Explicit closed-loop planning with code significantly outperforms implicit methods like ReAct.
The 'refine-then-resume' mechanism allows agents to recover from errors mid-task without restarting, saving time and compute.
Code-based prompting acts as a strong regularizer, reducing hallucinations compared to natural language prompts.
Skill discovery is highly effective for improving sample efficiency, achieving SOTA results with orders of magnitude less data than training-based methods.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting strategies
Reinforcement Learning basics (policy, trajectory, environment feedback)
Code generation and execution concepts

Key Terms

closed-loop system: A system that continuously monitors outputs/feedback and uses that information to adjust future actions (as opposed to open-loop which executes a fixed plan)

in-plan feedback: Environmental observations that align with the agent's expectations and provide specific details needed for the next step (e.g., finding an object's ID)

out-of-plan feedback: Observations that contradict the agent's expectations (e.g., an action failure), triggering a replanning process

hallucination: When an LLM generates content that is nonsensical or unfaithful to the source/context

sub-goal: An intermediate objective within a larger plan (e.g., 'find the apple' is a sub-goal of 'clean the apple')

few-shot exemplars: A small number of example input-output pairs provided in the prompt to guide the model's behavior

API calls: Requests sent to the LLM service (like OpenAI's GPT-3) to generate text or code