ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests

📝 Paper Summary

Recommender Systems Benchmarking Sequential Recommendation Privacy-Preserving Data Construction

ORBIT standardizes recommender system evaluation with consistent public benchmarks and ClueWeb-Reco, a hidden test set constructed by soft-matching real user browsing histories to public webpages to preserve privacy.

Core Problem

Existing recommendation datasets rely on unrealistic proxies like reviews rather than actual browsing behavior, lack explicit user consent, and suffer from inconsistent evaluation splits that hinder reproducibility.

Why it matters:

Reviewing behavior (1-2% of interactions) is far sparser than and distinct from viewing behavior, meaning current benchmarks fail to model real user interests
Inconsistent data splits and metric definitions across studies make it impossible to fairly compare state-of-the-art models
Releasing real user browsing history for realistic evaluation poses severe privacy and legal risks regarding Personally Identifiable Information (PII)

Concrete Example: A user's browsing history might include sensitive local school applications or health searches. Releasing this raw sequence violates privacy (PII leakage), but synthetic data generated by LLMs fails to capture the complex, rapid topic shifts of real human surfing.

Key Novelty

Privacy-Preserving Soft-Matching for Realistic Evaluation Data

Collects real browsing history with consent, then replaces each private URL with the most semantically similar public webpage from the ClueWeb22 corpus using dense retrieval
Removes exact URL matches to ensure the dataset is fully synthetic while preserving the semantic trajectory and domain distribution of real user behavior

Architecture

The ClueWeb-Reco dataset construction pipeline, illustrating how private user history is transformed into a public dataset.

Evaluation Highlights

Constructed ClueWeb-Reco from 41,760 raw browsing records, resulting in 1,024 high-quality validation/test sessions after filtering
Achieved moderate inter-annotator agreement (Cohen's kappa 0.372) on the semantic relevance of soft-matched pages, confirming preservation of user intent
Maintained domain consistency: Top domains in the synthetic ClueWeb-Reco dataset (e.g., YouTube) closely mirror the rank distribution of the raw private data

Breakthrough Assessment

8/10

Addresses a critical crisis in recommendation research (reproducibility and realism) with a novel, privacy-safe method for releasing 'real' user behavior. The hidden test set paradigm is a significant maturity step for the field.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation / Next-Item Prediction

Inputs: User history sequence of items (webpages or products)

Outputs: The next item the user will interact with from the candidate pool

Pipeline Flow

Raw Data Collection (User Consent)
Quality Control (Filter Toxic/Scam)
Semantic Embedding (MiniCPM)
Dense Retrieval (DiskANN)
Soft-Matching Selection & Release

System Modules

Quality Control Filter

Remove scams, toxic content, and badly formatted data from raw submissions

Model or implementation: Rule-based and manual filters

Embedding Encoder (Soft Matching)

Encode webpage content into dense vectors for similarity comparison

Model or implementation: MiniCPM-Embedding-Light

Retriever (Soft Matching)

Find the most semantically similar public pages for each private URL

Model or implementation: DiskANN Index

Mapper / Selector (Soft Matching)

Select final proxy page to ensure privacy

Model or implementation: Heuristic Selection

Novel Architectural Elements

Privacy-preserving soft-matching pipeline that deliberately excludes exact matches (Top-1 hits) to guarantee the dataset is fully synthetic yet semantically grounded in real behavior

Comparison to Prior Work

vs. MSMARCO: MSMARCO's k-anonymity fails for long sequential histories (which are unique); ORBIT uses semantic soft-matching to synthetic proxies instead
vs. TREC CAsT: ORBIT automates the mapping process via dense retrieval rather than manual crafting, scaling to browsing histories
vs. RecBole: ORBIT enforces a fixed, server-side hidden test set evaluation to prevent data split manipulation or inconsistency

Limitations

Soft-matching is imperfect; moderate human agreement (Kappa 0.372) suggests some nuance of user intent is lost in translation
Requires access to the restricted ClueWeb22 dataset to see the actual content of the recommended items (only IDs are open)
Evaluation is currently limited to 5 public domains and 1 hidden web domain, though expandable
The method relies on the existence of a semantic proxy in the public corpus; unique private content may not have a good public match

Reproducibility

Code: https://www.open-reco-bench.ai

Benchmark code, leaderboards, and the ClueWeb-Reco dataset are publicly available. Raw user browsing history is NOT released to preserve privacy. Access to full ClueWeb22 content requires a separate license.

📊 Experiments & Results

Evaluation Setup

Leave-one-out sequential prediction: predict the last item given the previous n-1 items.

Benchmarks:

ClueWeb-Reco (Webpage Recommendation (Hidden Test)) [New]
MovieLens-1M (Movie Recommendation)
Amazon Reviews (Beauty, Toys, Sports, Books) (Product Recommendation)

Metrics:

Recall@K (K=1, 10, 50, 100)
NDCG@K (K=1, 10, 50, 100)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Analysis of the soft-matching quality, including retrieval score distributions and human relevance annotations.

Main Takeaways

Content-based models consistently outperform ID-based models on public benchmarks, likely due to better handling of temporal dynamics and item features.
Traditional recommendation models struggle on the ClueWeb-Reco hidden test due to the massive item candidate pool (87M pages) and high sparsity.
LLM-QueryGen (a baseline framing recommendation as retrieval via generated queries) shows promise on ClueWeb-Reco, suggesting LLMs better generalize to open-web recommendation than fixed-inventory models.
Performance varies significantly across datasets, indicating that data sparsity and training volume are critical factors influencing model ranking.

📚 Prerequisite Knowledge

Prerequisites

Understanding of sequential recommendation tasks
Familiarity with dense retrieval and embedding models
Basic knowledge of evaluation metrics (Recall, NDCG)

Key Terms

soft matching: The process of replacing a private document with a semantically similar public document to preserve privacy while maintaining information utility

ClueWeb22: A large-scale public dataset of web pages used as the target corpus for mapping private browsing histories

DiskANN: A graph-based approximate nearest neighbor search algorithm used for efficient retrieval over large vector indices

MiniCPM-Embedding-Light: The specific dense embedding model used to encode webpage content for similarity matching

NDCG: Normalized Discounted Cumulative Gain—a ranking metric that gives higher scores when relevant items appear earlier in the recommendation list

PII: Personally Identifiable Information—sensitive data that can be used to identify a specific individual

Recall@K: The proportion of relevant items found in the top-K recommendations