Evaluation Setup
Comparison of Recommendation Performance (AUC/UAUC) between Clean LLMs and Dirty LLMs (fine-tuned on leakage data)
Benchmarks:
- Target Evaluation Datasets (Sequential Recommendation / Top-K Recommendation)
Metrics:
- AUC (Area Under Curve)
- UAUC (User Averaged AUC)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Triple Effect of Leakage: Leakage produces three distinct outcomes: Spurious Gains (from ID data), Stability, or Degradation (from OOD data).
- In-domain (ID) leakage acts as a 'trap,' creating substantial but fake performance improvements that mask the model's actual inability to generalize.
- Out-of-domain (OOD) leakage acts as contamination, typically degrading recommendation accuracy by interfering with item characteristics learning.
- The 'Dirty LLM' simulation via LoRA effectively isolates the impact of memorization, proving that even lightweight parameter updates can significantly distort benchmark results.