Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning

📝 Paper Summary

Self-evolving Agentic reasoning Neuro-symbolic AI

This approach uses an LLM to identify missing planning operators for novel objects and generates dense reward functions to guide reinforcement learning agents in learning the necessary control policies.

Core Problem

Traditional symbolic planners fail when novel objects appear because the planning domain lacks the specific operators needed to interact with them, and standard RL exploration is too inefficient to discover these operators from scratch.

Why it matters:

Common everyday objects introduced into a robot's environment often disrupt pre-defined planning domains, causing autonomous agents to fail.
Existing hybrid methods rely on random RL exploration to discover missing operators, which scales poorly in continuous domains where plannable states are hard to reach by chance.
Current approaches either assume fully specified domains or commit to single reward functions that may mislead training.

Concrete Example: In a 'Coffee-Drawer' task, a pod is inside a closed drawer. A standard robot knows how to pick items from a table but lacks an 'open-drawer' operator. Random exploration fails to open the drawer, so the robot cannot retrieve the pod.

Key Novelty

LLM-Guided Missing Operator Identification and Curriculum Learning

Uses an LLM's common sense to structurally define missing PDDL operators (preconditions and effects) when a planner hits an impasse due to a novel object.
Prompts the LLM to write dense reward functions (code) for specific sub-goals derived from the new operator's effects, creating a curriculum for RL agents.
Runs multiple RL agents in parallel with different LLM-generated reward functions, pruning the worst performers to find a working policy.

Architecture

The overall Plan-Learn-Execute loop.

Evaluation Highlights

Outperforms Operator Discovery (OD) and LEAGUE-sparse baselines across four continuous robotic manipulation domains (Kitchen, Nut Assembly, Coffee-Box, Coffee-Drawer).
Successfully learns complex multi-stage tasks like 'Coffee-Drawer' (opening a drawer to retrieve a pod) where random exploration baselines fail completely.
Significantly speeds up learning by using LLM-generated dense rewards compared to sparse reward baselines.

Breakthrough Assessment

8/10

Strong contribution addressing the critical 'open-world' problem in robotics. By using LLMs to hallucinate valid PDDL operators and corresponding reward code, it bridges symbolic planning and continuous RL effectively where prior methods failed.

⚙️ Technical Details

Problem Definition

Setting: Hybrid planning and learning in continuous robotic domains with novel objects.

Inputs: PDDL domain with available operators, PDDL problem with entities (including a novel entity), initial state, and goal state.

Outputs: An augmented set of operators including the missing ones, and learned neural control policies (skills) to execute them.

Pipeline Flow

Missing Operator Identification: Hybrid Planner → Search-and-Prompt → Augmented PDDL
Sub-goal Curriculum: LLM Reward Generation → Parallel RL Agents → Policy Learning

System Modules

Hybrid LLM Symbolic Planner

Identify missing operators to bridge the gap between initial and goal states.

Model or implementation: GPT-o3 (temperature 0.3)

Reward Function Generator

Write Python code for dense reward functions to guide RL agents toward sub-goals.

Model or implementation: GPT-o4-mini (temperature 0.3)

RL Agent (Skill Learner)

Learn continuous control policies for the new operators.

Model or implementation: PPO (Proximal Policy Optimization)

Novel Architectural Elements

Search-and-prompt tree: interleaves LLM queries (to suggest operators) with symbolic search-ahead (to verify plan feasibility).
Genetic algorithm-inspired reward pruning: launches multiple RL agents with different LLM-written rewards and eliminates the worst-performing ones during training phases.

Modeling

Base Model: GPT-o3 (Planning), GPT-o4-mini (Reward Writing)

Training Method: Reinforcement Learning (PPO) for robot skills; In-context learning for LLMs.

Objective Functions:

Purpose: Maximize cumulative reward consisting of LLM-shaped progress and sparse sub-goal bonuses.

Formally: Not explicitly formulated in paper text beyond description.

Key Hyperparameters:

rl_algorithm: PPO
llm_temperature: 0.3 (both models)
operator_discovery_timesteps: 1 million (baseline)
+ 1 more
operator_learning_episode_length: 100 (initial), +50 per stage

Compute: Run on Ubuntu 20.04 with 24 Intel Core i9-12900K CPUs and one NVIDIA GeForce RTX 4090 GPU.

Comparison to Prior Work

vs. Operator Discovery: Structurally defines missing operators via LLM rather than relying on random exploration to stumble upon effects.
vs. LEAGUE++: Capable of domain expansion (adding missing operators) and uses multiple parallel reward candidates with pruning rather than a single function.
vs. Standard TAMP: Handles open-world novelties where operators are undefined.

Limitations

Relies on ground-truth state for effect checking (CheckEffectSatisfied), bypassing perception challenges.
Assumes the robot can detect and categorize novel objects using existing predicates.
Evaluation limited to simulation (MimicGen) and four specific domains.
Dependent on proprietary LLMs (GPT-o3/o4) for reasoning and code generation.

Reproducibility

Code availability is promised ('The code will be made publicly available') but no URL is currently in the text. Simulation environment is MimicGen. Uses proprietary models (GPT-o3, GPT-o4-mini).

📊 Experiments & Results

Evaluation Setup

Continuous robotic manipulation in the MimicGen simulation environment.

Benchmarks:

Kitchen (Easy) (Displace novel lid to access pot) [New]
Nut Assembly (Medium) (Pick up square nut from novel round peg) [New]
Coffee-Box (Medium) (Retrieve pod from novel box container) [New]
Coffee-Drawer (Hard) (Open novel drawer to retrieve pod) [New]

Metrics:

Success Rate (implied by qualitative comparison of 'finding a plan' vs failure)
Exploration efficiency (qualitative comparison)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper compares the proposed method against Operator Discovery (OD) and baselines (Reward Machine, LEAGUE-sparse). Specific numerical success rates are not provided in a table, but the text describes binary success/failure outcomes for the baselines in hard tasks.
Coffee-Drawer (Hard)	Success	Failure (0)	Success (implied)	Positive (Qualitative)

Experiment Figures

Illustration of the reward shaping function generation and filtering pipeline.

Main Takeaways

The method successfully identifies and learns operators for tasks where random exploration (Operator Discovery) fails completely (e.g., Coffee-Drawer).
LLM-generated dense rewards significantly speed up learning compared to sparse reward baselines.
The search-and-prompt mechanism allows the planner to efficiently identify missing operators by using symbolic search-ahead to validate LLM suggestions.

📚 Prerequisite Knowledge

Prerequisites

Task and Motion Planning (TAMP)
Reinforcement Learning (PPO)
Planning Domain Definition Language (PDDL)
Large Language Models (LLMs)

Key Terms

PDDL: Planning Domain Definition Language—a standard encoding for defining environments, operators, and goals in symbolic planning.

TAMP: Task and Motion Planning—a framework integrating high-level symbolic reasoning (what to do) with low-level motion generation (how to move).

Operator Discovery: The process of identifying new action schemas (operators) when existing ones are insufficient to reach a goal.

PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm used for training the robot's control policies.

Dense Reward: A continuous feedback signal provided at every step (e.g., distance to target) to guide learning, as opposed to a sparse reward given only upon success.

Reward Shaping: Modifying the reward function to include additional guidance, helping the agent learn faster.

MimicGen: A data generation system for robotic manipulation, used here as the simulation environment.

Grounding: Connecting abstract symbols (like 'open-drawer') to concrete physical states or actions in the real world.

Plannable States: States from which a known symbolic planner can find a sequence of operators to reach the goal.