Evaluation Setup
Comparison against baselines on a collected dataset of 1233 real-world urban itineraries from 4 Chinese cities.
Benchmarks:
- Real-world Urban Itinerary Dataset (Itinerary Generation) [New]
Metrics:
- Recall Rate (RR)
- Average Margin (AM - spatial deviation from TSP)
- Overlaps (OL - route intersections)
- Fail Rate (FR - hallucinated POIs)
- LLM-evaluated Match/Quality
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ItiNera demonstrates superior performance in both rule-based metrics (spatial accuracy, recall) and LLM-based metrics (quality, match) compared to purely LLM or purely optimization-based baselines. |
| Real-world Dataset |
Rule-based metrics (Recall, etc.) improvement |
0 |
30 |
+30
|
| Real-world Dataset |
Spatial Efficiency (Distance Margin vs TSP) |
0 |
100 |
+100
|
| Real-world Dataset |
Match (LLM-eval) |
Qualitative Lower |
Qualitative Higher |
Positive
|
Main Takeaways
- Integrating a mathematical solver (TSP) strictly prevents the 'spaghetti routing' problem common in pure LLM planners.
- The 'User-owned POI Database' approach effectively eliminates hallucinations (Fail Rate) compared to pure LLMs which hallucinate venues.
- Decomposing user requests into positive/negative embedding queries significantly improves the alignment (Match) of the retrieved POIs with user intent.
- Ablation studies show that removing the Cluster-aware Spatial Optimization (CSO) module forces the LLM to do routing, which degrades spatial coherence metrics.