P-RAG: Progressive retrieval augmented generation for planning on embodied everyday task

📝 Paper Summary

Agentic RAG pipeline Embodied AI planning

P-RAG improves embodied agent planning by iteratively building a database of the agent's own successful historical trajectories (instead of ground truth) and retrieving them based on task and scene similarity.

Core Problem

Traditional embodied agents struggle with understanding natural language instructions and lack task-specific knowledge, while LLM-based planners often hallucinate invalid actions or rely on unrealistic ground-truth data for few-shot examples.

Why it matters:

Real-world environments have hidden constraints (e.g., table capacity limits) that pre-trained LLMs do not know.
Relying on ground-truth for few-shot prompting is not scalable or realistic for novel, interactive environments where the optimal path is unknown.
Environments lack dense rewards (binary 0/1 feedback), making it hard for agents to learn sub-steps without explicit guidance or prior experience.

Concrete Example: In a simulation where a table can only hold three items, a standard LLM might command 'put apple on table' as a fourth item because it lacks specific environmental knowledge. P-RAG would retrieve a past failed attempt or successful constraint-abiding trajectory to avoid this error.

Key Novelty

Progressive Retrieval Augmented Generation (P-RAG)

Builds a dynamic memory database from the agent's own interaction history (self-generated experience) rather than static expert demonstrations.
Updates the database iteratively after each round, progressively accumulating successful trajectories to serve as few-shot examples for future tasks.
Uses a dual-retrieval mechanism that matches not just similar task instructions (semantic similarity) but also similar visual scene graphs (situational similarity).

Architecture

High-level framework of P-RAG illustrating the progressive loop.

Evaluation Highlights

Outperforms standard RAG and LLM-Planner baselines without using any ground truth actions for few-shot prompting.
Demonstrates self-improvement capabilities, increasing success rates over iterations as the retrieval database populates with better self-generated trajectories.
Achieves competitive performance on ALFRED and MINI-BEHAVIOR benchmarks compared to methods requiring extensive training or ground truth.

Breakthrough Assessment

7/10

Novel approach to 'ground truth-free' planning by bootstrapping from self-experience. While the retrieval mechanics are standard, the iterative self-building database for embodied planning is a significant practical step forward.

⚙️ Technical Details

Problem Definition

Setting: Embodied Everyday Task: an agent must execute a sequence of actions to change the environment state to meet a natural language goal.

Inputs: Natural language goal instruction I_g, current visual observation O_t (converted to scene graph), and retrieved historical context.

Outputs: A sequence of low-level actions A_t to interact with the environment.

Pipeline Flow

Data Collection (Interaction) -> Database Update -> Retrieval -> LLM Planning -> Execution

System Modules

Scene Graph Extractor

Converts visual observations into a structured text representation (objects and relationships) for the LLM.

Model or implementation: Rule-based/API-based tools (ALFWORLD tools or environment APIs)

Dual Retriever

Retrieves similar historical trajectories based on both task instruction and visual situation.

Model or implementation: MiniLM (encoder) + Cosine Similarity

LLM Planner

Generates a sequence of high-level actions based on inputs and retrieved context.

Model or implementation: GPT-4 or GPT-3.5

Action Processor

Validates format, filters text, and decomposes high-level actions into executable low-level actions.

Model or implementation: Regular Expressions + FMM (Fast Marching Method)

Novel Architectural Elements

Iterative database update loop: The system feeds its own interaction history (successful or not) back into the retrieval database after each episode.
Dual-key retrieval: Queries database using a composite score of both textual task similarity and visual scene graph similarity.

Modeling

Base Model: GPT-4 and GPT-3.5

Compute: Inference-only framework. Experiments run on ALFRED and MINI-BEHAVIOR simulators. Retrieval encoding uses MiniLM.

Comparison to Prior Work

vs. LLM-Planner: P-RAG does not use ground-truth action lists as few-shot samples; it generates data through interaction.
vs. RAP: P-RAG is designed for settings without ground truth guidance and updates dynamically.
vs. Standard RAG: P-RAG introduces a progressive/iterative update mechanism rather than a static one-shot retrieval.

Limitations

Dependency on LLM capabilities; weaker LLMs may fail to generate valid plans even with retrieved context.
The initial round (cold start) has no retrieval history, relying solely on zero-shot LLM performance.
Computational cost increases as the database grows, potentially requiring vector database optimizations for very long-term deployment.

Reproducibility

Code: https://github.com/Weiye-Xu/P-RAG

Code is publicly available at https://github.com/Weiye-Xu/P-RAG. The paper describes scene graph extraction logic for both ALFRED and MINI-BEHAVIOR. Uses off-the-shelf LLMs (GPT-4/3.5) via API.

📊 Experiments & Results

Evaluation Setup

Evaluated on two embodied AI simulation benchmarks: ALFRED and MINI-BEHAVIOR.

Benchmarks:

ALFRED (Embodied instruction following (household tasks))
MINI-BEHAVIOR (Grid-world embodied everyday tasks)

Metrics:

Success Rate (SR)
Goal Condition Success (GC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
P-RAG shows iterative improvement across rounds in MINI-BEHAVIOR, validating the progressive mechanism.
MINI-BEHAVIOR	Success Rate	0.20	0.55	+0.35
P-RAG outperforms baselines that lack the specific progressive retrieval mechanism.
ALFRED / MINI-BEHAVIOR	Success Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Detailed pipeline of Database Construction and Retrieval.

Main Takeaways

P-RAG successfully eliminates the need for ground-truth action sequences by leveraging the agent's own interaction history.
The progressive retrieval mechanism allows the agent to 'learn' from experience (self-iteration) without parameter updates, improving success rates over time.
Incorporating both scene graph similarity and task similarity in retrieval provides more relevant context than task similarity alone.
The method generalizes across different embodied environments (ALFRED and MINI-BEHAVIOR).

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) for planning
Embodied AI simulators (ALFRED, MINI-BEHAVIOR)
Scene Graphs

Key Terms

P-RAG: Progressive Retrieval Augmented Generation—the proposed framework that iteratively updates a database with the agent's own trajectories to aid future planning.

Scene Graph: A structured representation of an image where objects are nodes and their relationships (e.g., 'on top of') are edges, used here to help LLMs understand visual observations.

FMM: Fast Marching Method—a numerical technique used here to decompose high-level navigation actions (e.g., 'go to fridge') into low-level movement steps.

MiniLM: A compact language model used to encode text and scene graphs into vector embeddings for similarity search during retrieval.

ALFRED: A benchmark for embodied instruction following that requires agents to complete household tasks based on natural language.

MINI-BEHAVIOR: A simulation environment based on Gym-MiniGrid for complex embodied everyday tasks with logic-based states.