Evaluation Setup
Text-based benchmark for preference generalization and real-world robotic cleanup trials
Benchmarks:
- Benchmark Dataset (Text-based object placement prediction) [New]
Metrics:
- Placement Accuracy (Seen vs. Unseen objects)
- Real-world success rate (objects correctly put away)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Benchmark Dataset |
Accuracy (Unseen Objects) |
78.5 |
91.2 |
+12.7
|
| Benchmark Dataset |
Accuracy (Unseen Objects) |
67.5 |
91.2 |
+23.7
|
| Benchmark Dataset |
Accuracy (Unseen Objects) |
77.8 |
91.2 |
+13.4
|
| Physical Robot Trials |
Success Rate |
Not reported in the paper |
85.0 |
Not reported in the paper
|
Main Takeaways
- Summarization is key: Explicitly asking the LLM to summarize examples into a rule before applying it significantly improves accuracy over direct few-shot inference.
- LLMs generalize better than taxonomies: Hand-crafted ontologies like WordNet fail on attribute-based or function-based sorting (e.g., 'summer clothes'), where LLMs excel.
- Interpretable Perception: Using LLM-generated summaries to drive open-vocabulary perception creates a flexible system that doesn't need re-training for new object categories.
- The system handles multiple sorting criteria effectively, including category, attribute, function, and subcategory sorting.