Evaluation Setup
Travel planning with 6 tools (Flight, Hotel, Restaurant, Attraction, City, GoogleDistance) and a Notebook tool
Benchmarks:
- TravelPlanner (Constraint-satisfaction planning) [New]
Metrics:
- Delivery Rate (did the agent produce a plan?)
- Commonsense Constraint Pass Rate
- Hard Constraint Pass Rate
- Final Pass Rate (feasible plan meeting ALL constraints)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| TravelPlanner (Test Set) |
Final Pass Rate |
0.0 |
0.6 |
+0.6
|
| TravelPlanner (Test Set) |
Delivery Rate |
12.8 |
39.5 |
+26.7
|
| TravelPlanner (Test Set) |
Commonsense Constraint Pass Rate (Micro) |
41.6 |
63.0 |
+21.4
|
| TravelPlanner (Validation Set) |
Final Pass Rate |
1.1 |
2.8 |
+1.7
|
| TravelPlanner (Test Set - Sole-planning) |
Final Pass Rate |
0.6 |
4.4 |
+3.8
|
Main Takeaways
- State-of-the-art LLMs (even GPT-4) are currently incapable of reliable complex planning in real-world scenarios, with <1% success rate.
- Existing strategies like ReAct and Reflexion do not solve the core difficulty of handling multiple interdependent constraints.
- Primary failure modes include: inability to collect correct information (tool use errors), losing track of constraints (context limit/reasoning failure), and hallucinations.
- Sole-planning mode (where info is given) sees only marginal improvement, suggesting the core reasoning engine itself struggles with multi-constraint satisfaction, not just tool use.