TidyBot: Personalized Robot Assistance with Large Language Models

📝 Paper Summary

User-profile based personalization LLM-based recommendation Agentic AI

TidyBot uses the few-shot summarization capabilities of Large Language Models to infer generalized, personalized household cleanup rules from a small number of user examples.

Core Problem

Robots need to tidy up rooms according to highly variable, personalized user preferences (e.g., 'shirts in drawer' vs. 'shirts in closet') without requiring large training datasets for each user.

Why it matters:

Classical approaches require tedious manual specification of target locations for every object
Generic rules averaged over many users fail to capture individual cultural or personal storage preferences
Existing personalization methods (collaborative filtering, latent vectors) require expensive large datasets that may not generalize well

Concrete Example: A user might want 'yellow shirts' in the drawer and 'dark purple shirts' in the closet. A standard system might put all shirts in one place, whereas TidyBot generalizes this to 'light clothes in drawer, dark clothes in closet'.

Key Novelty

Generalization via LLM Summarization for Robotics

Leverages LLM text summarization to convert a few specific user examples (e.g., specific shirts) into generalized rules (e.g., 'dark clothing')
Uses the summarized nouns as open-vocabulary labels for an image classifier (CLIP), bridging high-level text rules with visual perception
Infers not just placement locations but also manipulation primitives (e.g., 'toss' vs. 'place') from text examples

Architecture

Conceptual illustration of the TidyBot pipeline.

Evaluation Highlights

Achieves 91.2% accuracy on unseen objects in a new text-based benchmark, significantly outperforming baselines like WordNet (67.5%) and RoBERTa embeddings (77.8%)
Real-world mobile manipulator (TidyBot) successfully puts away 85.0% of objects in physical test scenarios
Demonstrates ability to generalize across diverse sorting criteria including category, attribute, function, and subcategory

Breakthrough Assessment

8/10

Novel application of LLM summarization to solve the specific robotic problem of personalized generalization. Strong real-world deployment and a new benchmark, though limited to specific cleanup tasks.

⚙️ Technical Details

Problem Definition

Setting: Robotic household cleanup where objects must be moved to 'proper places' based on sparse user examples

Inputs: A few text examples of object placements (e.g., 'yellow shirt' -> 'drawer') and a list of unseen objects in the scene

Outputs: Generalized rules (text summaries) and specific placement/manipulation commands for unseen objects

Pipeline Flow

User Example Input
LLM Summarization
Perception & Grounding
Robotic Execution

System Modules

LLM Summarizer (Reasoning)

Summarize specific user examples into generalized rules (as code comments)

Model or implementation: text-davinci-003 (GPT-3)

LLM Planner (Reasoning)

Apply generalized rules to determine placement for unseen objects

Model or implementation: text-davinci-003 (GPT-3)

Open-Vocabulary Classifier

Identify objects in the real world using categories extracted from LLM summaries

Model or implementation: CLIP

Robot Controller

Execute physical cleanup

Model or implementation: Not explicitly named (Mobile Manipulator)

Novel Architectural Elements

Pipeline chaining LLM summarization directly into open-vocabulary perception: Nouns from the LLM summary become the dynamic label set for CLIP

Modeling

Base Model: text-davinci-003 (GPT-3 variant)

Training Method: In-context learning (Few-shot prompting)

Adaptation: None (Prompt engineering only)

Trainable Parameters: None (Frozen LLM)

Key Hyperparameters:

temperature: 0

Compute: Not reported in the paper

Comparison to Prior Work

vs. WordNet: TidyBot uses LLM summarization rather than fixed ontology paths, allowing for attribute-based (e.g., color) generalization
vs. Text Embeddings: TidyBot generates explicit interpretable rules rather than relying on latent space proximity, capturing complex user preferences better
vs. SOTA Rearrangement [not cited in paper]: Unlike CLIP-Fields which uses semantic fields, TidyBot focuses on few-shot personalization via text

Limitations

Relies on the performance of the underlying LLM; failures in summarization lead to system failure
Perception is limited by the capabilities of CLIP and the quality of nouns extracted from the summary
Physical manipulation primitives are pre-determined (pick-and-place, pick-and-toss) and not learned
Requires users to provide text examples, which might still be tedious compared to purely passive observation

Reproducibility

Code: https://tidybot.cs.princeton.edu

Publicly available: Benchmark dataset and code at https://tidybot.cs.princeton.edu. Missing: Specific hardware controller code details not emphasized. Uses closed-source model (text-davinci-003).

📊 Experiments & Results

Evaluation Setup

Text-based benchmark for preference generalization and real-world robotic cleanup trials

Benchmarks:

Benchmark Dataset (Text-based object placement prediction) [New]

Metrics:

Placement Accuracy (Seen vs. Unseen objects)
Real-world success rate (objects correctly put away)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Benchmark Dataset	Accuracy (Unseen Objects)	78.5	91.2	+12.7
Benchmark Dataset	Accuracy (Unseen Objects)	67.5	91.2	+23.7
Benchmark Dataset	Accuracy (Unseen Objects)	77.8	91.2	+13.4
Physical Robot Trials	Success Rate	Not reported in the paper	85.0	Not reported in the paper

Main Takeaways

Summarization is key: Explicitly asking the LLM to summarize examples into a rule before applying it significantly improves accuracy over direct few-shot inference.
LLMs generalize better than taxonomies: Hand-crafted ontologies like WordNet fail on attribute-based or function-based sorting (e.g., 'summer clothes'), where LLMs excel.
Interpretable Perception: Using LLM-generated summaries to drive open-vocabulary perception creates a flexible system that doesn't need re-training for new object categories.
The system handles multiple sorting criteria effectively, including category, attribute, function, and subcategory sorting.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and few-shot prompting
Open-vocabulary image classification (CLIP)
Basic robotic manipulation concepts (pick-and-place)

Key Terms

LLM: Large Language Model—AI models trained on vast text data capable of reasoning and summarization

CLIP: Contrastive Language-Image Pre-training—a model that connects text descriptions with images, allowing classification using arbitrary text labels

few-shot summarization: Asking an AI model to create a summary or rule based on a very small number of examples provided in the prompt

manipulation primitive: A basic robotic action, such as 'pick and place' or 'pick and toss'

WordNet: A lexical database of English grouping words into sets of synonyms, used here as a baseline for object categorization

RoBERTa: A robustly optimized BERT pre-training approach, used here to generate text embeddings for baseline comparisons