Evaluation Setup
Retrieval of entities from three domains (Amazon, MAG, Prime) given natural language queries.
Benchmarks:
- STaRK-Amazon (Product Recommendation (E-commerce)) [New]
- STaRK-MAG (Academic Paper Search) [New]
- STaRK-Prime (Precision Medicine/Biomedical Inquiry) [New]
Metrics:
- Hit@1
- Hit@5
- Recall@20
- Mean Reciprocal Rank (MRR)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance of baseline retrievers on synthesized queries shows sparse methods often beating dense ones, with rerankers providing significant boosts. |
| STaRK-Amazon |
Hit@1 |
29.5 |
17.0 |
-12.5
|
| STaRK-Prime |
Hit@1 |
8.7 |
18.0 |
+9.3
|
| STaRK-MAG |
Recall@20 |
32.0 |
49.0 |
+17.0
|
| STaRK-Amazon (Human) |
Hit@1 |
35.3 |
59.3 |
+24.0
|
| STaRK-Prime |
MRR |
Not reported in the paper |
27.0 |
Not reported in the paper
|
Main Takeaways
- Sparse retrieval (BM25) is a surprisingly strong baseline, often beating dense retrievers (DPR, ANCE) likely because SKB entities have distinct identifiers better captured by exact matching.
- Standard dense embedding models (ada-002) fail to capture fine-grained relational constraints, often retrieving items with correct keywords but wrong relations (e.g., wrong brand).
- LLM Rerankers (GPT-4/Claude3) provide the best performance by far, confirming that complex reasoning is required, but their high latency and cost make them difficult to scale.
- The benchmark is harder than existing ones: even the best systems achieve <20% Hit@1 on the biomedical domain (Prime), highlighting a need for better semi-structured retrieval systems.