LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay

📝 Paper Summary

E-commerce Search & Advertising Dense Retrieval Knowledge Distillation

LLMDistill4Ads improves ad keyphrase recommendations by distilling relevance judgments from an LLM teacher through a cross-encoder assistant into a scalable bi-encoder student using Pearson correlation loss.

Core Problem

Click-based training data for ad recommendations is sparse and biased because it only reflects keyphrases previously approved by the search engine ('middleman bias') and subject to ranking position bias.

Why it matters:

Items ranked lower receive fewer clicks regardless of relevance, making lack of clicks an unreliable negative signal
Training on biased click logs perpetuates existing system limitations, preventing the discovery of new, relevant keyphrases for advertisers
High latency of accurate cross-encoder models prevents their direct use in large-scale retrieval systems with billions of items

Concrete Example: An item might be relevant to 'vintage lamp', but if the current search engine never shows the item for that query, no clicks are generated. A model trained only on clicks will learn that 'vintage lamp' is irrelevant, whereas an LLM or human judge would identify the missed opportunity.

Key Novelty

Two-Stage Multi-Task Distillation (LLM → CE → BE)

Uses a 'Teacher-Assistant' framework where a heavy LLM labels data to train a Cross-Encoder assistant, which then teaches a lightweight Bi-Encoder student
Employs a multi-task objective combining supervised click data (CTR), Search Relevance scores (SR), and Pearson correlation-based distillation from the assistant to calibrate ranking scores
Integrates heterogeneous signals to mitigate 'middleman bias'—allowing the model to learn from accepted, rejected, and unseen keyphrases

Architecture

The multi-task training framework showing the Teacher-Assistant-Student hierarchy and data sources.

Evaluation Highlights

+51.26% increase in Gross Merchandise Volume (GMB) bought in a 12-day online A/B test compared to a CTR-only baseline
+38.69% improvement in Return on Ad Spend (ROAS) in the same online test
+11.75% increase in average adopted keyphrase count per item, indicating better alignment with seller preferences

Breakthrough Assessment

7/10

Strong industrial application showing significant online business gains. While the teacher-assistant distillation architecture is known, the specific application to mitigating middleman bias in ads with Pearson loss is impactful.

⚙️ Technical Details

Problem Definition

Setting: Extreme Multi-Label Classification (XMC) / Embedding-Based Retrieval (EBR) where items must be mapped to relevant buyer queries (keyphrases)

Inputs: Item title and metadata (category)

Outputs: Ranked list of relevant keyphrases (buyer queries)

Pipeline Flow

Item Encoder (microBERT) → Vector Embedding
ANN Search (vs Keyphrase Index) → Candidate Keyphrases

System Modules

Bi-Encoder Student

Generate embeddings for items to retrieve relevant keyphrases efficiently

Model or implementation: microBERT (distilled mobileBERT, ~6 layers)

Novel Architectural Elements

Integration of Matryoshka Loss for embedding truncation to 64 dimensions within a multi-task distillation framework [Architectural choice for inference efficiency]

Modeling

Base Model: microBERT (distilled version of eBERT/mobileBERT)

Training Method: Multi-task Knowledge Distillation

Objective Functions:

Purpose: Learn from user clicks.

Formally: Multiple Negatives Ranking Loss (MNR) with In-Batch Random Negative Sampling
Purpose: Distill ranking knowledge from Cross-Encoder.

Formally: Pearson Correlation Loss maximizing correlation between CE logits and BE cosine similarities
Purpose: Align with Search Relevance filter.

Formally: Contrastive Loss / Softmax Loss on SR labels

Training Data:

Click Data: 10.7M instances (Positive if CTR > 0.05)
Search Relevance (SR): 18.7M instances (Positive if SR score > threshold)
LLM Labels: 50M instances labeled by Mixtral 8x7B Instruct-v0.1

Key Hyperparameters:

embedding_dimension: 64 (truncated)
batch_construction: Single dataset type per batch (heterogeneous mixing)

Compute: Not reported in the paper

Comparison to Prior Work

vs. D2LLM: LLMDistill4Ads adds a 'Assistant' Cross-Encoder step (LLM->CE->BE) rather than direct LLM->BE distillation, and incorporates explicit business signals (CTR, SR) via multi-task learning
vs. Standard Bi-Encoders: Uses Pearson Correlation Loss instead of MSE/KL for distillation, which is shown to better preserve ranking order
vs. XMC approaches: Formulates as dense retrieval with semi-open vocabulary rather than fixed label classification

Limitations

Relies on proprietary eBay datasets, making direct replication impossible
Seller adoption signal is noisy (rejection might be due to bid price, not keyphrase relevance)
Requires maintaining a heavy Cross-Encoder assistant pipeline for training updates

Reproducibility

No code provided. Proprietary eBay datasets (Click logs, SR scores) are not available. Teacher model (Mixtral 8x7B) is public, but fine-tuning data is internal.

📊 Experiments & Results

Evaluation Setup

Offline evaluation on held-out test sets and online A/B testing on eBay US traffic

Benchmarks:

Internal eBay Test Set (Keyphrase Relevance Classification/Ranking) [New]

Metrics:

F1 Score
Pearson Correlation (ρ)
Incremental Keyphrases (KP)
Pass Rate (PR) by LLM judge
Online: GMB, ROAS, Adoption Count
Statistical methodology: Online A/B test results reported with p-values (p=0.01 for GMB, p=0.02 for ROAS)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Distillation loss ablation showing Pearson correlation effectively transfers ranking knowledge.
Internal Test Set	Pearson Correlation (ρ)	0.76	0.87	+0.11
Internal Test Set	F1 Score	0.81	0.88	+0.07
Ablation of supervision signals showing the benefit of combining LLM, CTR, and Distillation.
Internal Evaluation	Pass Rate (PR)	60	71	+11
Internal Evaluation	Incremental Keyphrases (KP)	7	12	+5
Online A/B test results demonstrating real-world impact.
eBay Live Traffic	GMB (Gross Merchandise Volume)	0.00	51.26	+51.26%

Experiment Figures

Venn-style diagram motivation showing misalignment between Advertising, Search, and Seller relevance judgments.

Main Takeaways

Two-stage distillation (LLM → Cross-Encoder → Bi-Encoder) consistently outperforms direct LLM → Bi-Encoder distillation (Pearson ρ 0.87 vs 0.76).
Pearson correlation loss is superior to MSE, KL-divergence, and Contrastive loss for distilling ranking behavior in this domain.
Combining LLM-generated labels with traditional click data (CTR) provides the best balance of diversity (coverage) and precision, mitigating click biases while maintaining relevance.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (Teacher-Student)
Bi-Encoders vs. Cross-Encoders for information retrieval
Contrastive Learning / Loss functions (InfoNCE, MSE)

Key Terms

Middleman Bias: A form of selection bias where training data only contains samples that passed a previous system's filter (e.g., search engine approval), hiding potential positives that were filtered out

Cross-Encoder: A model that processes query and document simultaneously (full self-attention), offering high accuracy but high computational cost

Bi-Encoder: A model that encodes query and document independently into vectors, allowing fast retrieval via nearest neighbor search but with lower accuracy than cross-encoders

Pearson Correlation Loss: A loss function that maximizes the linear correlation between teacher and student scores, focusing on preserving the relative ranking and distribution shape rather than absolute values

GMB: Gross Merchandise Volume—total sales dollar value of merchandise sold

ROAS: Return on Advertising Spend—revenue generated for every dollar spent on advertising

MNAR: Missing Not At Random—the pattern of missing data is related to the unobserved data itself (e.g., users don't click relevant items because they are ranked low)