Sein Kim, Sangwu Park, Hongseok Kang, Wonjoong Kim, Jimin Seo, Yeonjun In, Kanghoon Yoon, Chanyoung Park
Korea Advanced Institute of Science and Technology
arXiv
(2026)
RecommendationAgentRAG
📝 Paper Summary
Automated Recommender System DesignLLM-driven Code Evolution
Self-EvolveRec automates recommender system design by coupling a user simulator that provides qualitative critiques with a diagnostic tool that verifies structural failures, guiding an LLM to iteratively evolve the code.
Core Problem
Existing automated design methods (NAS) are limited to fixed search spaces, while recent LLM-driven evolution relies on scalar metrics (e.g., NDCG) that fail to explain root causes of failure.
Why it matters:
Scalar metrics cannot distinguish between different failure modes (e.g., popularity bias vs. lack of diversity), leading to undirected trial-and-error optimization.
Manual refinement of the entire recommendation pipeline is inefficient and costly, while NAS fails to optimize non-architectural components like loss functions.
Without diagnostic feedback, LLM agents cannot generate targeted code fixes for complex structural or behavioral deficiencies.
Concrete Example:If a model's NDCG drops, scalar metrics don't reveal why. A user simulator might explain, 'I seek low-cost accessories, not expensive electronics,' pinpointing a semantic mismatch that a single number hides.
Key Novelty
Directional Feedback Loop with Co-Evolution
Integrates a User Simulator for qualitative natural language critiques (e.g., 'too much repetition') with a Model Diagnosis Tool for quantitative verification (e.g., measuring embedding collapse).
Implements a 'Co-Evolution' strategy where the diagnosis tool itself evolves alongside the recommender, generating new metrics to mathematically verify the simulator's subjective complaints.
Architecture
Overview of Self-EvolveRec framework, highlighting the Directional Feedback Generation (User Simulator + Model Diagnosis) and the Co-Evolution process.
Evaluation Highlights
Outperforms state-of-the-art NAS and LLM-driven baselines in recommendation performance and user satisfaction.
Validates that directional feedback leads to deterministic improvements in technical quality of evolved algorithmic logic.
Demonstrates the ability to resolve structural failures like embedding collapse through targeted diagnostic interventions.
Breakthrough Assessment
8/10
Significant step forward in agentic coding for RecSys. Moving from scalar-metric optimization to qualitative/diagnostic feedback loops is a strong methodological contribution.
⚙️ Technical Details
Problem Definition
Setting: Bi-level optimization in an open-ended program space S to find an optimal codebase B*
Inputs: Seed codebase B(0) (including recommender architecture, data loaders, optimization loop) and dataset D
Outputs: Optimal codebase B* that maximizes a recommendation metric M within T iterations
Evaluates recommendation lists using diverse user personas to provide natural language critiques
Model or implementation: LLM-based agent
Model Diagnosis Tool (DIAG) (Feedback Generation)
Probes the model's underlying mechanisms (e.g., embeddings, margins) to quantitatively substantiate simulator critiques
Model or implementation: Python code module (evolvable)
Planner & Retriever (Evolution)
Formulates research queries based on feedback and retrieves relevant academic literature
Model or implementation: LLM-based agent
Coder (Evolution)
Implements code modifications based on the development report
Model or implementation: LLM-based agent
Novel Architectural Elements
Diagnosis Tool - Model Co-Evolution: The evaluation logic (DIAG) itself is dynamically rewritten by the LLM to align with new model architectures and user feedback
Dual-feedback mechanism coupling qualitative user simulation with quantitative structural probing
Modeling
Base Model: LLM used for the agent (Specific model not explicitly named in text, likely GPT-4 or similar based on context of complex code generation)
Comparison to Prior Work
vs. AlphaEvolve/DeepEvolve: Self-EvolveRec uses directional feedback (Simulator + Diagnosis) instead of just scalar metrics
vs. NAS methods: Targets open-ended program space (loss functions, data processing) rather than fixed operator pools
vs. Agent4Rec/RecoWorld: Uses simulators for optimization feedback loops, not just evaluation or environment simulation
Limitations
Reliance on simulation fidelity: if the user simulator is biased, the optimization may drift.
Computational cost: iterative LLM calls and model training are expensive.
Initialization sensitivity: the quality of the seed codebase and initial diagnosis tool affects the trajectory.
Code is publicly available at https://github.com/Sein-Kim/self_evolverec. The paper details the user persona construction and the initial diagnostic probes (embedding collapse, ranking margin).
📊 Experiments & Results
Evaluation Setup
Evolutionary optimization of recommender system codebases
Benchmarks:
General Recommendation (Top-k Item Recommendation)
Metrics:
Hit Ratio (HR)
NDCG
User Satisfaction (Simulated)
Statistical methodology: Not explicitly reported in the paper