Evaluation Setup
Evaluation on 5 diverse agent workloads measuring cost, latency, and accuracy.
Benchmarks:
- Five diverse agent workloads (Varied (coding, web navigation, etc. - inferred from Introduction, specific dataset names not explicitly listed in text segments provided))
Metrics:
- Cost (USD/tokens)
- Latency (time)
- Accuracy (success rate)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Average across 5 workloads |
Cost Reduction |
0.00 |
50.31 |
-50.31
|
| Average across 5 workloads |
Latency Reduction |
0.00 |
27.28 |
-27.28
|
| Average across 5 workloads |
Performance Retention |
100.00 |
96.61 |
-3.39
|
Main Takeaways
- Query-based similarity matching (standard semantic caching) is sub-optimal for agents due to high false positives/negatives from context details.
- Small planner LMs struggle with long-context raw execution logs; structured 'plan templates' are necessary for effective reuse.
- The system effectively separates core intent from dynamic context, enabling reuse where traditional caching fails.