Evaluation Setup
Evaluated on two embodied AI simulation benchmarks: ALFRED and MINI-BEHAVIOR.
Benchmarks:
- ALFRED (Embodied instruction following (household tasks))
- MINI-BEHAVIOR (Grid-world embodied everyday tasks)
Metrics:
- Success Rate (SR)
- Goal Condition Success (GC)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| P-RAG shows iterative improvement across rounds in MINI-BEHAVIOR, validating the progressive mechanism. |
| MINI-BEHAVIOR |
Success Rate |
0.20 |
0.55 |
+0.35
|
| P-RAG outperforms baselines that lack the specific progressive retrieval mechanism. |
| ALFRED / MINI-BEHAVIOR |
Success Rate |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- P-RAG successfully eliminates the need for ground-truth action sequences by leveraging the agent's own interaction history.
- The progressive retrieval mechanism allows the agent to 'learn' from experience (self-iteration) without parameter updates, improving success rates over time.
- Incorporating both scene graph similarity and task similarity in retrieval provides more relevant context than task similarity alone.
- The method generalizes across different embodied environments (ALFRED and MINI-BEHAVIOR).