Language Technologies Institute, Carnegie Mellon University,
Meta
arXiv
(2025)
RecommendationBenchmarkP13N
📝 Paper Summary
Recommender Systems BenchmarkingSequential RecommendationPrivacy-Preserving Data Construction
ORBIT standardizes recommender system evaluation with consistent public benchmarks and ClueWeb-Reco, a hidden test set constructed by soft-matching real user browsing histories to public webpages to preserve privacy.
Core Problem
Existing recommendation datasets rely on unrealistic proxies like reviews rather than actual browsing behavior, lack explicit user consent, and suffer from inconsistent evaluation splits that hinder reproducibility.
Why it matters:
Reviewing behavior (1-2% of interactions) is far sparser than and distinct from viewing behavior, meaning current benchmarks fail to model real user interests
Inconsistent data splits and metric definitions across studies make it impossible to fairly compare state-of-the-art models
Releasing real user browsing history for realistic evaluation poses severe privacy and legal risks regarding Personally Identifiable Information (PII)
Concrete Example:A user's browsing history might include sensitive local school applications or health searches. Releasing this raw sequence violates privacy (PII leakage), but synthetic data generated by LLMs fails to capture the complex, rapid topic shifts of real human surfing.
Key Novelty
Privacy-Preserving Soft-Matching for Realistic Evaluation Data
Collects real browsing history with consent, then replaces each private URL with the most semantically similar public webpage from the ClueWeb22 corpus using dense retrieval
Removes exact URL matches to ensure the dataset is fully synthetic while preserving the semantic trajectory and domain distribution of real user behavior
Architecture
The ClueWeb-Reco dataset construction pipeline, illustrating how private user history is transformed into a public dataset.
Evaluation Highlights
Constructed ClueWeb-Reco from 41,760 raw browsing records, resulting in 1,024 high-quality validation/test sessions after filtering
Achieved moderate inter-annotator agreement (Cohen's kappa 0.372) on the semantic relevance of soft-matched pages, confirming preservation of user intent
Maintained domain consistency: Top domains in the synthetic ClueWeb-Reco dataset (e.g., YouTube) closely mirror the rank distribution of the raw private data
Breakthrough Assessment
8/10
Addresses a critical crisis in recommendation research (reproducibility and realism) with a novel, privacy-safe method for releasing 'real' user behavior. The hidden test set paradigm is a significant maturity step for the field.
Inputs: User history sequence of items (webpages or products)
Outputs: The next item the user will interact with from the candidate pool
Pipeline Flow
Raw Data Collection (User Consent)
Quality Control (Filter Toxic/Scam)
Semantic Embedding (MiniCPM)
Dense Retrieval (DiskANN)
Soft-Matching Selection & Release
System Modules
Quality Control Filter
Remove scams, toxic content, and badly formatted data from raw submissions
Model or implementation: Rule-based and manual filters
Embedding Encoder (Soft Matching)
Encode webpage content into dense vectors for similarity comparison
Model or implementation: MiniCPM-Embedding-Light
Retriever (Soft Matching)
Find the most semantically similar public pages for each private URL
Model or implementation: DiskANN Index
Mapper / Selector (Soft Matching)
Select final proxy page to ensure privacy
Model or implementation: Heuristic Selection
Novel Architectural Elements
Privacy-preserving soft-matching pipeline that deliberately excludes exact matches (Top-1 hits) to guarantee the dataset is fully synthetic yet semantically grounded in real behavior
Comparison to Prior Work
vs. MSMARCO: MSMARCO's k-anonymity fails for long sequential histories (which are unique); ORBIT uses semantic soft-matching to synthetic proxies instead
vs. TREC CAsT: ORBIT automates the mapping process via dense retrieval rather than manual crafting, scaling to browsing histories
vs. RecBole: ORBIT enforces a fixed, server-side hidden test set evaluation to prevent data split manipulation or inconsistency
Limitations
Soft-matching is imperfect; moderate human agreement (Kappa 0.372) suggests some nuance of user intent is lost in translation
Requires access to the restricted ClueWeb22 dataset to see the actual content of the recommended items (only IDs are open)
Evaluation is currently limited to 5 public domains and 1 hidden web domain, though expandable
The method relies on the existence of a semantic proxy in the public corpus; unique private content may not have a good public match
Benchmark code, leaderboards, and the ClueWeb-Reco dataset are publicly available. Raw user browsing history is NOT released to preserve privacy. Access to full ClueWeb22 content requires a separate license.
📊 Experiments & Results
Evaluation Setup
Leave-one-out sequential prediction: predict the last item given the previous n-1 items.
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Analysis of the soft-matching quality, including retrieval score distributions and human relevance annotations.
Main Takeaways
Content-based models consistently outperform ID-based models on public benchmarks, likely due to better handling of temporal dynamics and item features.
Traditional recommendation models struggle on the ClueWeb-Reco hidden test due to the massive item candidate pool (87M pages) and high sparsity.
LLM-QueryGen (a baseline framing recommendation as retrieval via generated queries) shows promise on ClueWeb-Reco, suggesting LLMs better generalize to open-web recommendation than fixed-inventory models.
Performance varies significantly across datasets, indicating that data sparsity and training volume are critical factors influencing model ranking.
📚 Prerequisite Knowledge
Prerequisites
Understanding of sequential recommendation tasks
Familiarity with dense retrieval and embedding models
Basic knowledge of evaluation metrics (Recall, NDCG)
Key Terms
soft matching: The process of replacing a private document with a semantically similar public document to preserve privacy while maintaining information utility
ClueWeb22: A large-scale public dataset of web pages used as the target corpus for mapping private browsing histories
DiskANN: A graph-based approximate nearest neighbor search algorithm used for efficient retrieval over large vector indices
MiniCPM-Embedding-Light: The specific dense embedding model used to encode webpage content for similarity matching
NDCG: Normalized Discounted Cumulative Gain—a ranking metric that gives higher scores when relevant items appear earlier in the recommendation list
PII: Personally Identifiable Information—sensitive data that can be used to identify a specific individual
Recall@K: The proportion of relevant items found in the top-K recommendations