Generative Product Recommendations for Implicit Superlative Queries

📝 Paper Summary

E-commerce Search LLM-based Ranking

The paper introduces SUPERB, a dataset and four-level relevance schema for 'best X' queries, and demonstrates that LLM-based listwise ranking significantly outperforms traditional retrieval for identifying the best products.

Core Problem

Traditional retrieval systems struggle with 'implicit superlative queries' (e.g., 'best shoes for marathons') because they rely on explicit keyword matching rather than inferring complex, subjective attributes like durability or safety.

Why it matters:

Users frequently search for the 'best' products using vague terms, leading to query-product mismatches in standard systems
Existing relevance labels (like ESCI) capture objective relevance (Exact/Substitute) but fail to capture the subjective quality or superiority required for superlative queries

Concrete Example: For the query 'best toy for a 3 year old girl', a standard system might return any toy matching the keywords. However, a superlative system must infer implicit attributes like 'ASTM F963 safety standards', 'non-toxic materials', and 'engaging colors' to recommend the actual best options.

Key Novelty

SUPERB (Superlatives with Best relevance annotations)

Proposes a four-point relevance schema (Overall Best, Almost Best, Relevant But Not Best, Not Relevant) specifically designed to distinguish top-tier products from merely relevant ones
Introduces 'Deliberated Prompting' for ranking: forcing the LLM to first generate implicit product attributes (reasoning) before assigning a relevance label to reduce bias and improve accuracy

Architecture

The Deliberated Prompting workflow for generating relevance annotations

Evaluation Highlights

Listwise re-ranking achieves 0.529 nDCG@10, significantly outperforming the BM25 baseline (0.380) on the SUPERB dataset
Sliding-window listwise ranking on top-100 items yields 0.449 nDCG@10 compared to 0.347 for BM25, showing robustness in larger contexts
Listwise approaches consistently outperform pointwise and pairwise LLM ranking methods for superlative queries

Breakthrough Assessment

7/10

Establishes a necessary formalization for a common but under-studied query type ('implicit superlatives') and provides a dataset/schema (SUPERB) that enables future work, though the modeling techniques (Listwise/CoT) are existing methods applied to this new domain.

⚙️ Technical Details

Problem Definition

Setting: Ranking product candidates for implicit superlative queries

Inputs: Superlative query q (e.g., 'best running shoes') and a list of candidate products

Outputs: Ranked list of products ordered by their degree of satisfying the implicit superlative criteria

Pipeline Flow

Initial Retrieval (BM25 / RM3)
Candidate Selection (Top-K items)
LLM Re-ranking (Listwise or Deliberated Pointwise)
Final Output Generation

System Modules

Retriever

Fetch initial candidate products from the corpus

Model or implementation: BM25 or RM3

Re-ranker

Re-order candidate products based on implicit superlative criteria

Model or implementation: Claude-Haiku

Modeling

Base Model: Claude-Haiku (for ranking experiments), Claude-Sonnet (for data generation)

Training Method: Zero-shot and Few-shot Prompting (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RankGPT: Adopts RankGPT-style sliding window but specifically evaluates on implicit superlative queries using the new SUPERB schema
vs. Standard Pointwise: Introduces 'Deliberated Prompting' to explicate attributes before scoring, addressing the subjectivity of 'best' [not cited in paper]

Limitations

Evaluation is limited to a constrained setting with item descriptions truncated to 512 tokens
Analysis is performed on a single dataset derived from Amazon Shopping Queries
Reliance on proprietary models (Claude family) rather than open weights models
LLMs may over-generalize on queries with very specific technical attributes compared to keyword matching

Reproducibility

Code: https://github.com/emory-irlab/SUPERB

publicly available (https://github.com/emory-irlab/SUPERB). Dataset (SUPERB) contains 29,218 triplets. Evaluation uses PyTerrier and PyTerrier-GenRank. Prompts are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Re-ranking task on the SUPERB dataset

Benchmarks:

SUPERB (Product Re-ranking for Superlative Queries) [New]

Metrics:

nDCG
P@k (Precision at k)
Statistical methodology: Paired t-test with Holm-Bonferroni correction

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Re-ranking experiments on top-10 items show Listwise approaches significantly outperforming baselines and pointwise methods.
SUPERB	nDCG@10	0.380	0.529	+0.149
SUPERB	P@10	0.279	0.385	+0.106
SUPERB	nDCG@10	0.407	0.529	+0.122
Sliding window experiments on larger candidate sets (top-100) confirm the scalability of the listwise approach.
SUPERB	nDCG@10	0.347	0.449	+0.102

Experiment Figures

Scatter plot comparing nDCG@10 of BM25 vs. Listwise Ranking per query

Main Takeaways

Listwise ranking consistently outperforms pointwise, pairwise, and deliberated pointwise approaches for superlative queries
LLMs excel at queries involving style, versatility, and aesthetics (e.g., 'modern refrigerators for minimalist kitchen') where reasoning over implicit attributes is required
BM25 still performs competitively on queries with very clear, well-defined technical criteria where lexical matching is sufficient (e.g., 'rv caulking sealant')

📚 Prerequisite Knowledge

Prerequisites

Information Retrieval basics (BM25, nDCG)
LLM Prompting strategies (Zero-shot, Chain-of-Thought)
Re-ranking architectures (Pointwise, Listwise)

Key Terms

Implicit Superlative Queries: Search queries seeking the 'best' of a category without explicitly stating the criteria (e.g., 'best shoes for trail running' implies grip, durability, ankle support)

SUPERB: Superlatives with Best relevance annotations—the authors' proposed dataset and 4-level labeling schema for superlative queries

Listwise Ranking: Prompting an LLM with a query and a list of multiple documents, asking it to output a ranked order of those documents

Pointwise Ranking: Prompting an LLM to score or label a single document at a time independently of others

Deliberated Prompting: A two-step prompting strategy where the LLM first generates reasoning (e.g., attributes of a 'best' product) before generating the final score or label

ESCI: Exact, Substitute, Complement, Irrelevant—a standard e-commerce relevance scale used in the source dataset

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

RM3: A pseudo-relevance feedback model that expands the original query using terms from the top initially retrieved documents