HotelQuEST: Balancing Quality and Efficiency in Agentic Search

📝 Paper Summary

Agentic RAG pipeline Benchmark

HotelQuEST is a benchmark comprising 214 hotel search queries with ground-truth clarifications that evaluates agents on both solution quality and computational efficiency, revealing severe over-computation in current systems.

Core Problem

Existing agentic search benchmarks focus primarily on answer quality, neglecting critical efficiency constraints (cost, latency) and the challenge of underspecified user preferences common in real-world scenarios.

Why it matters:

High latency and cost make many high-performing agentic systems impractical for real-world commercial deployment
Standard benchmarks fail to capture how agents handle vague constraints (e.g., 'dog-friendly' implies different things to different users), leading to inaccurate relevance assessments
Current agents lack adaptive routing, applying expensive reasoning even to simple queries where lightweight retrieval would suffice

Concrete Example: For the query 'Hotel for a solo traveler,' the intent is underspecified. Without the hidden clarification (e.g., 'affordable hostels in safe neighborhoods'), an agent might retrieve luxury hotels. Standard evaluation misses this mismatch, while HotelQuEST uses the clarification to penalize the agent.

Key Novelty

HotelQuEST (Hotel Quality & Efficiency Search Testbed)

Introduces 'Clarifications': explicit statements of user intent for underspecified queries (e.g., defining 'dog-friendly' as 'no fee'), provided only to the evaluator (judge), not the agent
Jointly evaluates Quality (accuracy, factuality) and Efficiency (cost, tokens, latency) to identify trade-offs ignored by accuracy-only leaderboards
Proposes 'Budget Oracle' and 'Quality Oracle' metrics to establish theoretical upper bounds on how much efficiency can be gained by optimal model routing

Architecture

The iterative agentic workflow used for the baselines.

Evaluation Highlights

The 'Budget Oracle' achieves higher accuracy at $1 cost than the best agent (Sonnet 3.7) while costing 96x less ($1 vs $4.56)
Sonnet 3.7 achieves the highest accuracy (4.44/5.0) but is prohibitively expensive ($4.56 per query) compared to lightweight retrievers (approx $0)
Current agents exhibit significant inefficiency, with cost increasing for complex queries but failing to yield proportional accuracy gains compared to optimal routing

Breakthrough Assessment

8/10

Crucial contribution to practical agent deployment. By exposing the massive cost-inefficiency of current agents and introducing 'clarifications' for evaluation, it addresses major blind spots in existing benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Agentic information retrieval from a hotel catalog containing structured data and unstructured reviews

Inputs: Natural language query q (potentially underspecified)

Outputs: Ranked list of top-k hotels with grounded evidence justifying the selection

Pipeline Flow

Plan (select source: Descriptions, Reviews, Web)
Retrieve (execute search query)
Filter (prune results, update memory)
Repeat until k hotels found or max turns reached

System Modules

Planner (Agent Core)

Selects information source and generates search queries based on current memory state

Model or implementation: Various LLMs (Claude Sonnet 3.7/4/Haiku, Qwen3-32B)

Retriever Tools

Executes the generated search query against the selected data source

Model or implementation: Search API / Tavily API

Filter & Update (Agent Core)

Prunes irrelevant results and updates the memory state with new findings

Model or implementation: Same LLM as Planner

Novel Architectural Elements

Use of 'Clarification' strictly for the evaluation judge (LLM-as-a-judge) to assess alignment with underspecified user intent, distinct from standard relevance judgments

Modeling

Base Model: Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3 Haiku, Qwen3-32B

Compute: Not reported in the paper (evaluation only, no training performed)

Comparison to Prior Work

vs. MTEB: Evaluates full agentic loop (tools + reasoning) and efficiency, not just embedding quality
vs. SimpleQA [not cited in paper]: Focuses specifically on commercial search with underspecified constraints rather than factual factoid QA
vs. existing Agent Benchmarks (GAIA, etc.): Explicitly measures cost/latency trade-offs and includes hidden user clarifications for evaluation

Limitations

Evaluation is limited to the hotel domain, limiting generalization to other search verticals
Relies on LLM-as-a-judge (Sonnet 4.5), which may have inherent biases despite high human agreement
Analysis focuses on specific commercial models (Claude), so results might change with open-weights models
Clarifications are written by annotators, assuming they perfectly capture the 'true' user intent

Reproducibility

Code: https://github.com/amazon-science/hotel-quest-benchmark

publicly available (https://github.com/amazon-science/hotel-quest-benchmark). Dataset includes 214 queries, clarifications, and complexity ratings. Code includes the LangGraph agent implementation.

📊 Experiments & Results

Evaluation Setup

Agentic hotel search over a corpus of ~1M hotel descriptions and ~21M reviews

Benchmarks:

HotelQuEST (Agentic Search (Retrieval + Reasoning)) [New]

Metrics:

Accuracy (1-5 Likert scale)
Factuality (1-5 Likert scale)
Cost ($ per query)
Latency (seconds)
Total Tokens
Statistical methodology: Welch's t-test and Spearman correlation for query attribute analysis (p < 0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison showing the trade-off between retrieval baselines (cheap, low accuracy) and agentic systems (expensive, high accuracy).
HotelQuEST	Accuracy (1-5)	1.27	4.44	+3.17
HotelQuEST	Cost ($)	0.00	4.56	+4.56
HotelQuEST	Accuracy (1-5)	3.25	4.44	+1.19
Oracle analysis demonstrating the potential for cost optimization via ideal model routing.
HotelQuEST	Accuracy (1-5)	4.44	4.51	+0.07

Experiment Figures

Accuracy vs. Budget curve for the Budget Oracle.

Impact of increasing maximum tool calls on cost, latency, and accuracy.

Main Takeaways

Agents achieve significantly higher accuracy than retrievers but suffer from extreme inefficiency, often over-investing compute in simple queries.
The 'Budget Oracle' demonstrates that intelligent routing could reduce costs by 96x while maintaining or exceeding state-of-the-art accuracy.
Agents frequently engage in redundant tool calls (e.g., repeated searches) without gaining new information, highlighting a lack of cost-aware stopping criteria.
Query complexity affects retrieval models significantly, but capable agents (Sonnet 3.7) are robust to complexity, primarily scaling their cost rather than dropping accuracy.

📚 Prerequisite Knowledge

Prerequisites

Agentic Search
Retrieval-Augmented Generation (RAG)
LLM-as-a-judge evaluation

Key Terms

Agentic Search: Search systems where an LLM agent iteratively plans, executes tool calls (search/filter), and synthesizes answers rather than just retrieving documents

Underspecified queries: Queries where user preferences are vague or implicit (e.g., 'good vibe'), requiring the system to infer intent or general norms

Clarification: A hidden ground-truth note written by the query author explaining their specific intent, used by the judge to evaluate relevance but not shown to the agent

Budget Oracle: A theoretical upper bound metric that selects the best model combination to maximize accuracy under a fixed total monetary budget

Quality Oracle: A theoretical upper bound metric that selects the cheapest model capable of achieving the highest possible accuracy for each specific query

LangGraph: A library for building stateful, multi-actor applications with LLMs, used here to orchestrate the agent's plan-retrieve-filter loop

Reranker: A model that rescores the initial set of retrieved documents to improve precision before the final answer generation

BM25: A probabilistic retrieval function based on term frequency and inverse document frequency, used as a baseline sparse retriever