Evaluation Setup
Top-k recommendation on implicit feedback datasets, splitting interactions 8:1:1.
Benchmarks:
- Amazon Games (Product Recommendation)
- Amazon Toys (Product Recommendation)
- Amazon Books (Product Recommendation)
Metrics:
- Recall@10
- Recall@20
- NDCG@10
- NDCG@20
- Statistical methodology: Significance tests conducted between L3AE and non-linear models (p-value < 0.05 implied by asterisks in tables).
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| L3AE consistently outperforms both non-linear and linear baselines across all datasets, with the largest gains on the sparsest datasets (Books). |
| Amazon Books |
Recall@20 |
0.1676 |
0.2409 |
+0.0733
|
| Amazon Books |
NDCG@20 |
0.0841 |
0.1315 |
+0.0474
|
| Amazon Games |
Recall@20 |
0.2482 |
0.2737 |
+0.0255
|
| Amazon Toys |
Recall@20 |
0.2565 |
0.2641 |
+0.0076
|
Main Takeaways
- Linear models (L3AE, EASE) generally outperform complex non-linear models (LightGCN, AlphaRec) on these sparse datasets, with the gap widening as sparsity increases.
- Semantic-guided regularization is more effective than naive fusion (Collective/Additive methods) because it respects the different spectral characteristics (rank properties) of semantic vs. interaction matrices.
- Smaller, domain-aligned LLMs (NV-Embed-v2) can outperform larger general-purpose LLMs (LLaMA-3.2-3B) in generating useful item representations for recommendation.