Generating Query Recommendations via LLMs

📝 Paper Summary

Query Recommendation Generative Information Retrieval

GQR (Generative Query Recommendation) uses large language models to generate relevant search query suggestions without relying on historical query logs, outperforming traditional log-based commercial systems.

Core Problem

Existing query recommendation systems rely heavily on massive, private query logs to find patterns, making them ineffective for rare (long-tail) queries or for cold-start scenarios where no logs exist.

Why it matters:

Long-tail queries (rare searches) make up a huge portion of search traffic but have insufficient historical data for traditional log-based recommenders.
Query logs are proprietary and privacy-sensitive, preventing many researchers and smaller organizations from building effective recommendation systems.
Commercial systems often fail to generate any suggestions for rare inputs, leaving users without guidance.

Concrete Example: For a rare query appearing only once or twice in logs (e.g., specific long-tail searches in AOL data), commercial systems like 'System 1' and 'System 2' fail to generate suggestions 9-17% of the time. GQR (GPT-3) successfully generates 6 recommendations 100% of the time for these same queries.

Key Novelty

Generative Query Recommendation (GQR)

Replaces the traditional log-mining paradigm with a generative paradigm using Large Language Models (LLMs) like GPT-3.
Uses few-shot prompting to instruct the LLM to generate diverse and disambiguated query variations based solely on the input query, without needing a historical database.

Evaluation Highlights

Outperforms commercial 'System 2' by +23% (Robust04) and +27% (ClueWeb09B) in NDCG@10 when suggestions are used for query expansion.
Achieves ~59% user preference in a human evaluation study, significantly beating two commercial competitors (System 1 at ~26%, System 2 at ~15%).
100% success rate in generating recommendations for long-tail/rare queries, whereas commercial baselines fail up to 17% of the time.

Breakthrough Assessment

7/10

Strong practical contribution demonstrating that LLMs can completely replace log-based systems for query recommendation, with superior performance on rare queries. However, the method relies on standard prompting rather than novel architectural changes.

⚙️ Technical Details

Problem Definition

Setting: Query Suggestion / Recommendation

Inputs: User query q

Outputs: List of recommended queries [q_1, q_2, ..., q_k]

Pipeline Flow

Prompt Construction (Instruction + Examples + Input Query)
LLM Generation (Generates list of suggestions)
Parsing (Extracts individual queries from generated text)

System Modules

Prompt Constructor

Creates the context for the LLM

Model or implementation: Rule-based

Generator

Generates query recommendations

Model or implementation: GPT-3 (text-davinci-003) or Bloom

Modeling

Base Model: GPT-3 (text-davinci-003) and Bloom

Compute: Not reported in the paper

Comparison to Prior Work

vs. Commercial Systems: GQR requires no query logs, handles long-tail queries perfectly (100% coverage vs <91%), and produces more engaging results according to users.
vs. Log-based methods: GQR is purely generative, meaning it can hallucinate or generate plausible but non-existent trends, but doesn't suffer from cold-start problems.
vs. Neural Query Generation [not cited in paper]: Unlike doc2query methods which generate queries from documents, GQR generates suggestions directly from a seed query.

Limitations

Relies on proprietary LLMs (GPT-3) for best performance; open source Bloom performed significantly worse.
Subjectivity in user preference studies (engagingness is subjective).
No latency analysis provided; LLM inference is typically slower than looking up logs in a hash table.

Reproducibility

No specific code repository is provided in the text. The method relies on commercial APIs (GPT-3) or open models (Bloom). The prompt structure is described generally (instruction + examples).

📊 Experiments & Results

Evaluation Setup

Query Suggestion Quality & Downstream Retrieval Performance

Benchmarks:

Robust04 (Ad-hoc Retrieval)
ClueWeb09B (Web Search Retrieval)
AOL Query Log (Query Suggestion (Tail queries))

Metrics:

SCS (Simplified Clarity Score)
NDCG@10 (Retrieval effectiveness when suggestions are added)
User Preference (Human evaluation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Substitution Protocol: Evaluating the quality of recommendations by using them instead of the original query. GQR (GPT-3) shows superior retrieval performance.
Robust04	NDCG@10 improvement	Not reported in the paper	Not reported in the paper	+23%
ClueWeb09B	NDCG@10 improvement	Not reported in the paper	Not reported in the paper	+27%
Concatenation Protocol: Evaluating recommendations as expansions to the original query. Adding GQR suggestions improves retrieval effectiveness.
Robust04	NDCG@10 improvement	0.4341	0.4601	+0.026
ClueWeb09B	NDCG@10 improvement	0.1951	0.2048	+0.0097
User Study: Human annotators preferred GQR suggestions over commercial systems.
AOL Queries	User Preference %	26.21	59.19	+32.98
Coverage analysis on rare queries shows GQR's robustness compared to log-based systems.
AOL Tail Queries	Failure Rate (0 suggestions)	17.0	0.0	-17.0

Main Takeaways

GQR (GPT-3) generates query recommendations that are less ambiguous (higher SCS) and more useful for retrieval than commercial log-based systems.
The system excels at the 'long tail': it never fails to generate suggestions for rare queries, whereas commercial systems often return nothing.
Prompt engineering robustness: The number of examples (1 vs 10) and the specific choice of examples in the prompt do not statistically significantly affect performance.
Users find generative suggestions more engaging (59% preference) than those from standard search engines.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Information Retrieval metrics (NDCG, Clarity Score)
Basic knowledge of Large Language Models and prompting
Familiarity with Query Expansion techniques

Key Terms

SCS: Simplified Clarity Score—a metric measuring the lack of ambiguity in a query; higher scores indicate clearer, more specific queries

NDCG@10: Normalized Discounted Cumulative Gain at rank 10—a measure of ranking quality that considers the position of relevant items

Query Expansion: The process of reformulating a seed query to improve retrieval performance, often by appending related terms or suggestions

Long tail queries: Search queries that are very rare or unique, having little to no historical frequency in query logs

Few-shot prompting: Providing a model with a small number of example input-output pairs in the prompt to guide its generation

AOL query log: A widely used public dataset of real user search queries released by AOL in 2006

Robust04: A standard TREC information retrieval test collection consisting of news articles and topics

ClueWeb09B: A large web crawl dataset used for evaluating web search performance