← Back to Paper List

HotelQuEST: Balancing Quality and Efficiency in Agentic Search

Guy Hadad, Shadi Iskander, Oren Kalinsky, Sofia Tolmach, Ran Levy, Haggai Roitman
Ben-Gurion University, Amazon
arXiv (2026)
Agent Benchmark RAG Factuality

📝 Paper Summary

Agentic RAG pipeline Benchmark
HotelQuEST is a benchmark comprising 214 hotel search queries with ground-truth clarifications that evaluates agents on both solution quality and computational efficiency, revealing severe over-computation in current systems.
Core Problem
Existing agentic search benchmarks focus primarily on answer quality, neglecting critical efficiency constraints (cost, latency) and the challenge of underspecified user preferences common in real-world scenarios.
Why it matters:
  • High latency and cost make many high-performing agentic systems impractical for real-world commercial deployment
  • Standard benchmarks fail to capture how agents handle vague constraints (e.g., 'dog-friendly' implies different things to different users), leading to inaccurate relevance assessments
  • Current agents lack adaptive routing, applying expensive reasoning even to simple queries where lightweight retrieval would suffice
Concrete Example: For the query 'Hotel for a solo traveler,' the intent is underspecified. Without the hidden clarification (e.g., 'affordable hostels in safe neighborhoods'), an agent might retrieve luxury hotels. Standard evaluation misses this mismatch, while HotelQuEST uses the clarification to penalize the agent.
Key Novelty
HotelQuEST (Hotel Quality & Efficiency Search Testbed)
  • Introduces 'Clarifications': explicit statements of user intent for underspecified queries (e.g., defining 'dog-friendly' as 'no fee'), provided only to the evaluator (judge), not the agent
  • Jointly evaluates Quality (accuracy, factuality) and Efficiency (cost, tokens, latency) to identify trade-offs ignored by accuracy-only leaderboards
  • Proposes 'Budget Oracle' and 'Quality Oracle' metrics to establish theoretical upper bounds on how much efficiency can be gained by optimal model routing
Architecture
Architecture Figure Figure 6
The iterative agentic workflow used for the baselines.
Evaluation Highlights
  • The 'Budget Oracle' achieves higher accuracy at $1 cost than the best agent (Sonnet 3.7) while costing 96x less ($1 vs $4.56)
  • Sonnet 3.7 achieves the highest accuracy (4.44/5.0) but is prohibitively expensive ($4.56 per query) compared to lightweight retrievers (approx $0)
  • Current agents exhibit significant inefficiency, with cost increasing for complex queries but failing to yield proportional accuracy gains compared to optimal routing
Breakthrough Assessment
8/10
Crucial contribution to practical agent deployment. By exposing the massive cost-inefficiency of current agents and introducing 'clarifications' for evaluation, it addresses major blind spots in existing benchmarks.
×