Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

📝 Paper Summary

Agent evolution Reinforcement Learning for Recommendation Automated Machine Learning (AutoML)

A dual-agent system leveraging LLMs acts as an automated Machine Learning Engineer to autonomously propose, code, and validate novel recommendation model architectures and reward functions for YouTube.

Core Problem

Optimizing industrial recommendation systems requires navigating an intractable search space of architectures and non-differentiable reward functions, a task that exceeds traditional AutoML capabilities and currently relies on unscalable human intuition.

Why it matters:

Traditional AutoML (e.g., Bayesian optimization) is limited to numerical parameter tuning and cannot invent new logic or structural designs.
Human-driven iteration is slow and linear to engineering headcount, leaving vast regions of the solution space unexplored.
There is a critical alignment gap between differentiable training proxies and long-term user satisfaction, which requires complex semantic reward engineering.

Concrete Example: A standard AutoML system can tune a learning rate but cannot hypothesize that a user slice is under-served and write new reward logic to fix it. Specifically, it cannot invent a 'Gating Path' mechanism to replace embedding lookups or formulate a composite reward blending watch time and survey responses.

Key Novelty

Hierarchical MLE Agent Framework (Offline/Online Split)

Decouples discovery into an 'Offline Agent' (Inner Loop) for high-throughput hypothesis generation using proxy metrics and an 'Online Agent' (Outer Loop) for low-frequency validation against delayed business metrics.
Uses specialized LLM personas (Optimizer, Architecture, Reward) that act as expert engineers: they read production code, reason about past experiments, and write executable code diffs rather than just selecting parameters.
Introduces a 'Think-Code-Verify' cycle where agents use tools like `compute_loss` and `run_sql_query` to validate semantic ideas before expensive production deployment.

Architecture

The Self-Evolving System architecture, illustrating the dual-loop structure with Offline and Online agents sharing an Experiment Journal.

Evaluation Highlights

Agents successfully discovered novel architectural components (e.g., 'Gating Path' mechanisms) and multi-objective reward functions that aligned better with long-term satisfaction.
Demonstrated success through production launches at YouTube, confirming autonomous evolution can surpass human-engineered baselines.
Ablation studies quantify the relationship between model reasoning power (Gemini 2.5 Pro vs. lightweight variants) and discovery performance.

Breakthrough Assessment

9/10

Represents a significant leap from parameter tuning to structural code generation in a massive-scale industrial setting. Successfully automates the highly complex role of a research engineer.

⚙️ Technical Details

Problem Definition

Setting: Bi-level optimization: Lower level trains a ranking model to minimize proxy loss; Upper level optimizes system configuration (optimizer, architecture, reward) to maximize online north star metrics.

Inputs: Historical interaction logs, current model codebase, experiment journal

Outputs: Deployable model configuration diffs (architecture code, reward logic, optimizer settings)

Pipeline Flow

Offline Agent (Inner Loop): Hypothesis Generation → Code Implementation → Offline Validation
Online Agent (Outer Loop): Proposal Queue → Safety Validation → Live Experimentation → Feedback Loop

System Modules

Offline Agent (Hypothesis Generation (Inner Loop))

Generates and filters high-potential model changes using offline proxies

Model or implementation: Gemini 2.5 Pro

Persona: Optimizer (Hypothesis Generation (Inner Loop))

Iteratively proposes changes to optimizer classes and hyperparameters

Model or implementation: Gemini 2.5 Pro

Persona: Architecture (Hypothesis Generation (Inner Loop))

Proposes structural mutations to neural topology (e.g., new layers, activation functions)

Model or implementation: Gemini 2.5 Pro

Persona: Reward (Hypothesis Generation (Inner Loop))

Edits logic defining ground-truth training labels to better align with user satisfaction

Model or implementation: Gemini 2.5 Pro

Online Agent

Orchestrates live A/B testing of candidates and manages safety

Model or implementation: System Logic (State Machine)

Novel Architectural Elements

Dual-loop agentic architecture: High-frequency Offline Agent (Inner Loop) for generation vs. Low-frequency Online Agent (Outer Loop) for validation
Specialized Agent Personas (Optimizer, Architecture, Reward) with distinct tooling (loss calculation vs. SQL signal discovery)
Self-contained Experiment Journal acting as shared memory for continuous evolution

Modeling

Base Model: Gemini 2.5 Pro (for the agents)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Google Vizier/Auto-Sklearn: Can generate novel code structures and logic, not just tune parameters in fixed ranges.
vs. DARTS: Open-ended design space (writing code) rather than selecting from a menu of operations.
vs. Eureka: Handles delayed, sparse, non-differentiable rewards in a production environment without a clear simulation oracle.
+ 2 more
vs. OPRO: Hierarchical agent system with specific tooling (SQL, training) for industrial validity, rather than just text-based refinement.
vs. The AI Scientist: Applied to live industrial production with safety guardrails and A/B testing, not just offline academic benchmarks [not cited in paper].

Limitations

Reliance on delayed feedback loops makes the outer optimization cycle slow (days/weeks).
High dependency on the reasoning quality of the underlying LLM (Gemini 2.5 Pro).
Specifics of the discovered architectures and reward functions are proprietary and not detailed.

Reproducibility

No replication artifacts mentioned in the paper. The system is deployed internally at YouTube using proprietary logs and infrastructure.

📊 Experiments & Results

Evaluation Setup

Live production environment at YouTube

Benchmarks:

Live YouTube Traffic (Video Recommendation)

Metrics:

North Star Metrics (e.g., user engagement, satisfaction)
Proxy Loss (Offline)
Development Velocity
Statistical methodology: Standard A/B testing protocols (statistical significance checks mentioned but specific p-values not detailed)

Main Takeaways

Autonomous LLM-driven evolution effectively accelerates experimental velocity compared to human-bottlenecked workflows.
The system successfully moves beyond parameter tuning to discover semantic structural changes (architectures) and logic changes (rewards).
Reasoning power matters: ablation studies suggest Gemini 2.5 Pro significantly outperforms lightweight variants in discovery quality.
The dual-loop design effectively filters candidates, ensuring expensive online testing is reserved for high-promise hypotheses.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning for Recommendation Systems
Large Language Models (reasoning and code generation)
Industrial experimentation (A/B testing)
Automated Machine Learning (AutoML)

Key Terms

MLE: Machine Learning Engineer—the human role this system aims to automate

North Star Metrics: High-level business goals (e.g., long-term user retention, total watch time) that are often delayed and non-differentiable

Proxy Reward: A differentiable function used during training to approximate the non-differentiable North Star metrics

Inner Loop: The offline phase where agents generate and filter candidates using cheap proxy metrics (e.g., offline loss, SQL analysis) before live testing

Outer Loop: The online phase where surviving candidates are deployed to live traffic to measure actual North Star metrics

DCN: Deep Cross Network—a neural architecture designed to learn explicit feature interactions

RL: Reinforcement Learning—training agents to take actions that maximize cumulative reward

AutoML: Automated Machine Learning—tools that automate parts of the ML pipeline, typically limited to hyperparameter tuning

Gemini 2.5 Pro: A specific multimodal Large Language Model from Google with advanced reasoning capabilities