← Back to Paper List

Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

H Wang, Y Wu, D Chang, L Wei, L Heldt
YouTube (Google)
arXiv, 2/2026 (2026)
Recommendation Agent Reasoning RL

📝 Paper Summary

Agent evolution Reinforcement Learning for Recommendation Automated Machine Learning (AutoML)
A dual-agent system leveraging LLMs acts as an automated Machine Learning Engineer to autonomously propose, code, and validate novel recommendation model architectures and reward functions for YouTube.
Core Problem
Optimizing industrial recommendation systems requires navigating an intractable search space of architectures and non-differentiable reward functions, a task that exceeds traditional AutoML capabilities and currently relies on unscalable human intuition.
Why it matters:
  • Traditional AutoML (e.g., Bayesian optimization) is limited to numerical parameter tuning and cannot invent new logic or structural designs.
  • Human-driven iteration is slow and linear to engineering headcount, leaving vast regions of the solution space unexplored.
  • There is a critical alignment gap between differentiable training proxies and long-term user satisfaction, which requires complex semantic reward engineering.
Concrete Example: A standard AutoML system can tune a learning rate but cannot hypothesize that a user slice is under-served and write new reward logic to fix it. Specifically, it cannot invent a 'Gating Path' mechanism to replace embedding lookups or formulate a composite reward blending watch time and survey responses.
Key Novelty
Hierarchical MLE Agent Framework (Offline/Online Split)
  • Decouples discovery into an 'Offline Agent' (Inner Loop) for high-throughput hypothesis generation using proxy metrics and an 'Online Agent' (Outer Loop) for low-frequency validation against delayed business metrics.
  • Uses specialized LLM personas (Optimizer, Architecture, Reward) that act as expert engineers: they read production code, reason about past experiments, and write executable code diffs rather than just selecting parameters.
  • Introduces a 'Think-Code-Verify' cycle where agents use tools like `compute_loss` and `run_sql_query` to validate semantic ideas before expensive production deployment.
Architecture
Architecture Figure Figure 1
The Self-Evolving System architecture, illustrating the dual-loop structure with Offline and Online agents sharing an Experiment Journal.
Evaluation Highlights
  • Agents successfully discovered novel architectural components (e.g., 'Gating Path' mechanisms) and multi-objective reward functions that aligned better with long-term satisfaction.
  • Demonstrated success through production launches at YouTube, confirming autonomous evolution can surpass human-engineered baselines.
  • Ablation studies quantify the relationship between model reasoning power (Gemini 2.5 Pro vs. lightweight variants) and discovery performance.
Breakthrough Assessment
9/10
Represents a significant leap from parameter tuning to structural code generation in a massive-scale industrial setting. Successfully automates the highly complex role of a research engineer.
×