Self-EvolveRec: Self-Evolving Recommender Systems with LLM-based Directional Feedback

📝 Paper Summary

Automated Recommender System Design LLM-driven Code Evolution

Self-EvolveRec automates recommender system design by coupling a user simulator that provides qualitative critiques with a diagnostic tool that verifies structural failures, guiding an LLM to iteratively evolve the code.

Core Problem

Existing automated design methods (NAS) are limited to fixed search spaces, while recent LLM-driven evolution relies on scalar metrics (e.g., NDCG) that fail to explain root causes of failure.

Why it matters:

Scalar metrics cannot distinguish between different failure modes (e.g., popularity bias vs. lack of diversity), leading to undirected trial-and-error optimization.
Manual refinement of the entire recommendation pipeline is inefficient and costly, while NAS fails to optimize non-architectural components like loss functions.
Without diagnostic feedback, LLM agents cannot generate targeted code fixes for complex structural or behavioral deficiencies.

Concrete Example: If a model's NDCG drops, scalar metrics don't reveal why. A user simulator might explain, 'I seek low-cost accessories, not expensive electronics,' pinpointing a semantic mismatch that a single number hides.

Key Novelty

Directional Feedback Loop with Co-Evolution

Integrates a User Simulator for qualitative natural language critiques (e.g., 'too much repetition') with a Model Diagnosis Tool for quantitative verification (e.g., measuring embedding collapse).
Implements a 'Co-Evolution' strategy where the diagnosis tool itself evolves alongside the recommender, generating new metrics to mathematically verify the simulator's subjective complaints.

Architecture

Overview of Self-EvolveRec framework, highlighting the Directional Feedback Generation (User Simulator + Model Diagnosis) and the Co-Evolution process.

Evaluation Highlights

Outperforms state-of-the-art NAS and LLM-driven baselines in recommendation performance and user satisfaction.
Validates that directional feedback leads to deterministic improvements in technical quality of evolved algorithmic logic.
Demonstrates the ability to resolve structural failures like embedding collapse through targeted diagnostic interventions.

Breakthrough Assessment

8/10

Significant step forward in agentic coding for RecSys. Moving from scalar-metric optimization to qualitative/diagnostic feedback loops is a strong methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Bi-level optimization in an open-ended program space S to find an optimal codebase B*

Inputs: Seed codebase B(0) (including recommender architecture, data loaders, optimization loop) and dataset D

Outputs: Optimal codebase B* that maximizes a recommendation metric M within T iterations

Pipeline Flow

User Simulator (Qualitative Critique)
Model Diagnosis Tool (Quantitative Verification)
Evolutionary Archive & Retrieval (Context Retrieval)
Code Evolution (Implementation)

System Modules

User Simulator (SIM) (Feedback Generation)

Evaluates recommendation lists using diverse user personas to provide natural language critiques

Model or implementation: LLM-based agent

Model Diagnosis Tool (DIAG) (Feedback Generation)

Probes the model's underlying mechanisms (e.g., embeddings, margins) to quantitatively substantiate simulator critiques

Model or implementation: Python code module (evolvable)

Planner & Retriever (Evolution)

Formulates research queries based on feedback and retrieves relevant academic literature

Model or implementation: LLM-based agent

Coder (Evolution)

Implements code modifications based on the development report

Model or implementation: LLM-based agent

Novel Architectural Elements

Diagnosis Tool - Model Co-Evolution: The evaluation logic (DIAG) itself is dynamically rewritten by the LLM to align with new model architectures and user feedback
Dual-feedback mechanism coupling qualitative user simulation with quantitative structural probing

Modeling

Base Model: LLM used for the agent (Specific model not explicitly named in text, likely GPT-4 or similar based on context of complex code generation)

Comparison to Prior Work

vs. AlphaEvolve/DeepEvolve: Self-EvolveRec uses directional feedback (Simulator + Diagnosis) instead of just scalar metrics
vs. NAS methods: Targets open-ended program space (loss functions, data processing) rather than fixed operator pools
vs. Agent4Rec/RecoWorld: Uses simulators for optimization feedback loops, not just evaluation or environment simulation

Limitations

Reliance on simulation fidelity: if the user simulator is biased, the optimization may drift.
Computational cost: iterative LLM calls and model training are expensive.
Initialization sensitivity: the quality of the seed codebase and initial diagnosis tool affects the trajectory.

Reproducibility

Code: https://github.com/Sein-Kim/self_evolverec

Code is publicly available at https://github.com/Sein-Kim/self_evolverec. The paper details the user persona construction and the initial diagnostic probes (embedding collapse, ranking margin).

📊 Experiments & Results

Evaluation Setup

Evolutionary optimization of recommender system codebases

Benchmarks:

General Recommendation (Top-k Item Recommendation)

Metrics:

Hit Ratio (HR)
NDCG
User Satisfaction (Simulated)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The iterative evolution workflow: Evaluation -> Planning & Retrieval -> Code Evolution.

Main Takeaways

Self-EvolveRec significantly outperforms NAS and scalar-metric driven evolution baselines.
Directional feedback enables the system to diagnose and fix specific failure modes like embedding collapse and lack of diversity.
The co-evolution strategy successfully adapts diagnostic tools to new architectures, preventing evaluation obsolescence.
Qualitative feedback from the user simulator provides actionable insights that pure numerical metrics miss.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (collaborative filtering, matrix factorization)
Automated Machine Learning (AutoML) / Neural Architecture Search (NAS)
Large Language Models (LLMs) for code generation
Retrieval-Augmented Generation (RAG)

Key Terms

NAS: Neural Architecture Search—automating the design of neural networks

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality

Hit Ratio: The fraction of users for whom the correct item is included in the recommendation list

User Simulator: An LLM-based agent that mimics user behavior to provide qualitative feedback on recommendations

Embedding Collapse: A failure mode where item representations degenerate into a narrow subspace, losing discriminative power

Co-Evolution: The simultaneous evolution of the recommendation model and the diagnostic tools used to evaluate it

RAG: Retrieval-Augmented Generation—using external knowledge (e.g., papers) to inform LLM generation