Grocery to General Merchandise: A Cross-Pollination Recommender using LLMs and Real-Time Cart Context

📝 Paper Summary

Cross-Domain Recommendation E-commerce Search & Discovery

A two-stage recommender system bridges the gap between routine grocery and discretionary merchandise by using LLMs to generate novel associations and a transformer-based ranker to optimize for real-time cart context.

Core Problem

Traditional recommenders struggle to bridge the gap between high-frequency, low-price Grocery (OG) items and low-frequency, high-price General Merchandise (GM) due to category bias and lack of historical co-purchase data.

Why it matters:

Shoppers focused on routine groceries often miss relevant general merchandise, leading to lost revenue opportunities
Traditional collaborative filtering reinforces existing behaviors rather than sparking new category discovery (e.g., always recommending milk with cereal, never with frothers)
Offline analysis shows multi-category shoppers (OG + GM) generate 2.5x more revenue than single-category shoppers

Concrete Example: A customer buying milk is typically recommended cereal or cookies (grocery items). Current systems fail to recommend a 'milk frother' (general merchandise) because historical interaction data between these distinct categories is sparse or non-existent.

Key Novelty

Cross-Pollination (XP) Framework

Uses 'Agentic' LLMs to reason about item utility and lifestyle scenarios, generating cross-category connections (e.g., 'milk' -> 'frother') that do not appear in historical purchase graphs
Evaluates candidates using a 'Semantic Evaluation Agent' that acts as a judge, filtering poor matches before they reach the user
Re-ranks candidates using a transformer that encodes the entire real-time shopping cart, capturing dynamic user intent beyond single-item relevance

Architecture

End-to-end framework illustrating the two-stage process: Candidate Generation (Historical + LLM) followed by Real-Time Cart Context Ranking.

Evaluation Highlights

+36% increase in add-to-cart rate using LLM-based retrieval compared to baselines in A/B testing
+27% lift in NDCG@4 using the Cart Context-based Neural Ranker compared to item-only baselines
94% relevancy rate achieved on LLM-generated recommendations based on human evaluation of 200 anchor items

Breakthrough Assessment

7/10

Strong practical application of LLMs for cold-start cross-domain discovery. The combination of generative association with discriminative real-time ranking is effective, though the architecture uses standard components (GPT-4o, Transformer).

⚙️ Technical Details

Problem Definition

Setting: Cross-category item recommendation in an e-commerce setting

Inputs: Anchor grocery item (for candidate gen) and Current Cart Context C_t (for ranking)

Outputs: Ranked list of General Merchandise (GM) items

Pipeline Flow

Candidate Generation Group: Context Agent → Recommendation Agent → Contextual Evaluation Agent
Ranking Group: Cart Encoder → Cross-Attention → Scorer

System Modules

Context Generation Agent (Candidate Generation)

Generate thematic contexts (usage patterns, lifestyle scenarios) for an anchor grocery item

Model or implementation: gpt-4o

Recommendation Agent (Candidate Generation)

Produce specific GM product recommendations based on generated themes

Model or implementation: gpt-4o

Semantic Evaluation Agent (Candidate Generation)

Filter and score candidates using both LLM-as-judge and a Cross-Encoder

Model or implementation: gpt-4o (Judge) + Walmart-specific Cross-Encoder

Cart XP Ranker

Re-rank GM candidates based on the full sequence of items currently in the user's cart

Model or implementation: Transformer-based Neural Network

Novel Architectural Elements

Integration of an 'Agentic' LLM pipeline solely for generating cross-category search queries, coupled with a 'Semantic Evaluation Agent' to strictly filter hallucinations before retrieval
Use of a transformer encoder specifically to model the real-time composition of a grocery cart (Cart XP) to condition the ranking of general merchandise items

Modeling

Base Model: gpt-4o (for Candidate Gen), Custom Transformer (for Ranking)

Training Method: Supervised Learning (List-wise Softmax Loss)

Objective Functions:

Purpose: Optimize ranking to place purchased items higher.

Formally: Negative log-likelihood of positive item rankings using softmax function over set of positive and negative items.

Training Data:

Six months of cart interaction logs
Filtered for sessions with OG items in cart and at least one GM item view
Positive labels: Same-session clicks leading to add-to-cart within 7 days

Key Hyperparameters:

embedding_dim_transformed: 128 (reduced from 768)
title_embedding_model: MPNet
product_type_embedding_model: MPNet

Compute: Not reported in the paper

Comparison to Prior Work

vs. Collaborative Filtering: Uses LLMs to hallucinate plausible connections (zero-shot) rather than relying on historical co-occurrence
vs. Standard Cross-Domain: Incorporates real-time sequential cart context via Transformers rather than static user/item profiles
vs. OAG-GPT [not cited in paper]: Focuses specifically on Grocery-to-GM transition using chain-of-thought for lifestyle matching rather than general academic graph reasoning

Limitations

Relies on expensive LLM calls (GPT-4o) for candidate generation, which may have latency/cost implications at scale (though offline generation mitigates this)
Success depends heavily on the quality of the Semantic Search (E5) mapping LLM text to catalog items
Evaluation focuses on 'Add-to-cart' and 'NDCG' but does not explicitly report long-term conversion or return rates

Reproducibility

No replication artifacts mentioned in the paper. Code, model weights, and datasets are proprietary to Walmart. Prompts are described in Appendix A (implied availability in paper text, though appendix content not provided in snippet).

📊 Experiments & Results

Evaluation Setup

Offline historical analysis and Online A/B testing in Walmart's e-commerce environment

Benchmarks:

Walmart Internal Traffic (Real-world E-commerce Recommendation) [New]

Metrics:

Add-to-Cart (ATC) Rate
NDCG@4
Relevancy Rate (Human Eval)
Lift
Revenue
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results are reported as relative lifts (%) or qualitative multipliers rather than absolute baseline numbers. Detailed absolute metrics for baselines were not provided in the text.
Walmart Internal Data	Revenue Multiplier	1.0	2.5	+1.5
Manual Review (200 items)	Relevancy Rate	Not reported in the paper	94%	Not reported in the paper
Manual Review	Alignment Score	Not reported in the paper	95%	Not reported in the paper

Main Takeaways

LLM-based candidate generation achieved a 36% increase in add-to-cart rate, proving that generative associations can effectively bridge the gap between distinct shopping categories like Grocery and General Merchandise.
The Cart Context-based Neural Ranker provided a 27% lift in NDCG@4, demonstrating that considering the full composition of a user's current cart is superior to item-level relevance alone.
The automated 'Semantic Evaluation Agent' (LLM-as-judge + Cross-Encoder) achieved 95% alignment with human reviewers, validating it as a scalable proxy for quality control.
Cross-category shoppers are significantly more valuable (2.5x revenue), justifying the complexity of the cross-pollination framework.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Recommender Systems (Candidate Gen vs. Ranking)
Basic knowledge of Large Language Models (LLMs)
Familiarity with Transformer architectures

Key Terms

OG: Online Grocery—low-priced, routine purchase items like vegetables and meats

GM: General Merchandise—higher-priced, discretionary items like cookware and electronics

Lift: A metric quantifying the strength of association between two items, calculated as P(A,B) divided by P(A)P(B)

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the list

Cross-Encoder: A model that processes two input texts simultaneously to output a relevance score, capturing deeper semantic interaction than separate embeddings

Market Basket Analysis: A technique (like Apriori) to find associations between items frequently bought together

E5: A text embedding model used for semantic search and retrieval

MPNet: A sentence transformer model used here to generate title and product type embeddings for the ranker