Evaluation Setup
Multi-turn agent interaction across 5 environments (WebShop, SearchQA, TextCraft, AlfWorld, BabyAI).
Benchmarks:
- WebShop (Web navigation / e-commerce)
- SearchQA (Search-augmented QA)
- TextCraft (Minecraft crafting game)
- AlfWorld (Embodied household tasks)
- BabyAI (Gridworld navigation)
Metrics:
- avg@8 (Success Rate)
- Average Interaction Turns
- Average Generated Tokens
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Intra-environment experiments show RFT generalizes well from easy tasks to hard tasks within the same domain. |
| WebShop (Hard Tasks) |
avg@8 |
17.4 |
77.5 |
+60.1
|
| AlfWorld (Hard Tasks) |
avg@8 |
14.5 |
49.4 |
+34.9
|
| Inter-environment experiments reveal transfer is possible but highly asymmetric and environment-dependent. |
| AlfWorld (Held-In) |
avg@8 |
13.19 |
91.81 |
+78.62
|
| Average Held-Out Envs |
avg@8 |
29.28 |
34.19 |
+4.91
|
| WebShop |
avg@8 |
28.59 |
10.25 |
-18.34
|
| Sequential training experiments demonstrate that agents can learn new tasks without forgetting old ones. |
| TextCraft (Downstream) |
avg@8 |
80.88 |
82.50 |
+1.62
|
| WebShop (Upstream) |
avg@8 |
86.50 |
86.32 |
-0.18
|
Main Takeaways
- RFT significantly improves interaction efficiency (reducing steps and tokens) alongside success rate within the same environment.
- Generalization to unseen environments is strongly correlated with the similarity of action spaces and feedback mechanisms; transfer from sparse-reward environments (SearchQA) is better than from dense-reward ones (BabyAI).
- Sequential training allows agents to accumulate capabilities across environments with minimal forgetting, often matching or exceeding single-task performance.
- Failure mode analysis shows 'Confirmation Bias' (overconfidence without verification) is a persistent error pattern across all trained agents.