LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots

📝 Paper Summary

Robotic Task Planning Agentic Personalization Household Robotics

LLM-Personalize aligns robotic agents to household-specific preferences by combining imitation learning for initial capability with reinforced self-training for iterative preference learning.

Core Problem

General-purpose LLM planners lack alignment with specific household preferences (personalization), often failing to place objects where specific users want them despite understanding physical affordances.

Why it matters:

General LLM knowledge reflects common sense, which may conflict with unique individual habits (e.g., placing a mug in a cabinet vs. on a table)
Existing grounding methods focus on physical feasibility (affordances) or static scene graphs, neglecting the personalization gap in household robotics
Scalability to long-horizon, partially observable tasks remains challenging for standard LLM prompting methods

Concrete Example: One household may prefer a coffee mug on the dining table, while another prefers it in a kitchen cabinet. A standard LLM might default to a generic location, failing to satisfy the specific user's 'correct' placement criteria.

Key Novelty

Imitation-Bootstrapped Reinforced Self-Training for Robotics

Bootstraps the planner using Imitation Learning on a demonstrator to ensure it can parse complex contexts and generate executable plans
Refines the planner via Reinforced Self-Training (ReST), where the model iteratively generates plans, filters them based on user preference rewards, and fine-tunes on the successful examples
Uses a dynamic scene graph within the prompt to handle partial observability, updating the agent's belief state as it explores the house

Architecture

The overall framework of LLM-Personalize in the Housekeep environment.

Evaluation Highlights

Achieves >30% increase in success rate over existing LLM planners (Song et al., Ahn et al., Rana et al.) on the Housekeep benchmark
Demonstrates significantly improved alignment with human preferences in long-horizon object rearrangement tasks

Breakthrough Assessment

7/10

Addresses a critical gap (personalization) in embodied AI with a sound methodology (ReST), showing strong reported gains, though the core technique is an application of existing RL/LLM methods to a new domain.

⚙️ Technical Details

Problem Definition

Setting: Object rearrangement in multi-room, partially observable 3D household environments (Housekeep benchmark)

Inputs: Egocentric observations of receptacles/objects, current graph state G_t, and high-level task instructions

Outputs: Sequence of high-level actions (e.g., 'go to kitchen', 'pick up mug') to explore and rearrange objects

Pipeline Flow

Context Generator: Observation -> Update Graph -> Prompt
LLM Planner: Prompt -> Generate High-Level Plan
Controller: High-Level Plan -> Low-Level Actions -> Execution

System Modules

Context Generator

Maintains a dynamic scene graph of the household and constructs prompts including state description, instructions, and in-context examples

Model or implementation: Rule-based graph update

LLM Planner (Decision Making)

Generates a sequence of high-level actions based on the context prompt

Model or implementation: Pre-trained autoregressive LLM (fine-tuned)

Parser (Decision Making)

Extracts structured actions and target entities from the LLM's natural language output

Model or implementation: Rule-based parser

Controller

Maps high-level actions to low-level control primitives and executes them in the simulator

Model or implementation: Off-the-shelf Housekeep controller

Novel Architectural Elements

Iterative planning loop where the LLM re-plans after executing a sequence, updating a dynamic scene graph that starts empty
Integration of an expert Demonstrator module specifically designed to generate 'clean' training data (exploration vs. single-object rearrangement) for the LLM to imitate

Modeling

Base Model: Pre-trained autoregressive LLM (specific model name like GPT-3.5 implied by API discussion but not explicitly named in snippet)

Training Method: Imitation Learning followed by Reinforced Self-Training (ReST)

Objective Functions:

Purpose: Supervised fine-tuning on demonstration or filtered self-generated data.

Formally: L_NLL = -E[sum(log P_theta(y_tau | y_1:tau-1, x))]

Adaptation: Supervised Fine-Tuning (SFT) on filtered datasets

Training Data:

D_demo: Generated by a demonstrator that either visits all rooms (exploration) or rearranges a single discovered object
D_self-train: Collected by the agent interacting with training tasks, filtered for episodes with positive rewards (r > 0)

Comparison to Prior Work

vs. Song et al.: Adds personalization via ReST and dynamic scene graph maintenance
vs. Ahn et al.: Handles multi-room partially observable environments rather than single-room visible scenes
vs. Wu et al.: Directly optimizes the planner via fine-tuning rather than inferring preference rules as text summaries

Limitations

Dependency on the quality of the 'Demonstrator' for the bootstrapping phase
Requires ground truth preference data (reward signal) during the Self-Training phase
Evaluation is limited to a simulated environment (Housekeep)
Uses off-the-shelf low-level controller, assuming perfect execution of high-level actions once planned

Reproducibility

Code availability is not explicitly provided in the text. The method relies on the Housekeep simulator and standard LLM fine-tuning APIs. The paper mentions using an off-the-shelf controller from the simulator.

📊 Experiments & Results

Evaluation Setup

Housekeep benchmark: 3D simulated household environments with misplaced objects

Benchmarks:

Housekeep (Long-horizon object rearrangement (Clean Up))

Metrics:

Success Rate (rearrangement success)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

LLM-Personalize outperforms state-of-the-art baselines (Song et al., SayCan, SayPlan) by over 30% in success rate on the Housekeep benchmark.
The two-phase optimization (Imitation Learning + ReST) is crucial: IL ensures plan executability and basic context understanding, while ReST refines alignment with specific user preferences.
Iterative planning combined with a dynamic scene graph effectively handles partial observability in multi-room environments.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Self-Training)
Large Language Models (In-context learning, Fine-tuning)
Robotic Task Planning (Scene graphs, Affordances)

Key Terms

ReST: Reinforced Self-Training—an algorithm where a model improves by generating its own data, filtering for high-quality samples (based on a reward), and retraining on them

Housekeep: A 3D simulated benchmark for household robotic agents focused on tidying and object rearrangement tasks

Scene Graph: A structured representation of the environment (rooms, receptacles, objects) that updates dynamically as the robot explores

IL: Imitation Learning—training an agent to mimic the behavior of an expert demonstrator

Affordance: The possibility of an action on an object or environment (e.g., a cup 'affords' being picked up)

NLL: Negative Log Likelihood—a loss function used in supervised fine-tuning to maximize the probability of the target tokens

DPO: Direct Preference Optimization—an alternative alignment method requiring paired positive/negative samples, which the authors avoid due to API limitations