Evaluation Setup
Real-world robotic manipulation using a WidowX arm. Evaluated on generalization to new objects, scenes, viewpoints, and instructions.
Benchmarks:
- Generalization Suite (Real-world robot manipulation) [New]
Metrics:
- Success Rate
- Statistical methodology: 314 total trials per approach.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ECoT significantly outperforms baselines on generalization tasks involving unseen objects and scenes. |
| Generalization Suite (Avg) |
Success Rate |
Not explicitly reported as a single average in snippet |
Not explicitly reported as a single average in snippet |
+28% (absolute)
|
| Hardest Tasks Subset |
Success Rate |
32% |
80% |
+48%
|
Main Takeaways
- Embodied reasoning (ECoT) bridges the gap between semantic understanding and low-level control, drastically improving generalization (+28%).
- Purely semantic 'Naïve CoT' is insufficient; grounding reasoning in visual features like bounding boxes is critical for robot performance.
- A 7B parameter model with ECoT can outperform a 55B parameter model (RT-2-X) that lacks explicit reasoning steps.
- Exposing the reasoning chain allows for effective human-in-the-loop correction via natural language, which is impossible with black-box policies.