DeepRec: Towards a Deep Dive Into the Item Space with Large Language Model Based Recommendation

📝 Paper Summary

Sequential Recommendation LLM Agents for Recommendation

DeepRec enables LLMs to autonomously interact with traditional recommendation models via multi-turn reasoning and retrieval to deeply explore the item space before ranking.

Core Problem

Existing LLM-based recommenders perform 'shallow' exploration, either merely enhancing traditional models with features or conducting a single retrieval-then-rank step without iterative reasoning.

Why it matters:

Sequential recommendation requires complex reasoning over evolving user preferences that static traditional models cannot capture
Current LLM approaches fail to leverage the generative reasoning capable of refining search queries iteratively (like a human researcher)
Fine-tuning LLMs as standalone recommenders is computationally expensive and struggles to keep up with dynamic item pools

Concrete Example: A traditional model might retrieve items based solely on past clicks. An LLM-enhanced model might just re-rank those fixed candidates. DeepRec, however, acts like an agent: it generates a thought about the user's history, queries the traditional model with a specific preference description, analyzes the returned items, and decides whether to search again or finalize the list.

Key Novelty

Autonomous Reasoning-Retrieval Paradigm

Treats the Traditional Recommendation Model (TRM) as a 'tool' that an LLM agent can invoke multiple times via generated text commands
Uses a 'Preference-Aware' TRM that fuses user history embeddings with the LLM's generated text preference to retrieve more semantically relevant candidates
Employes a two-stage Reinforcement Learning strategy to first teach the LLM how to interact with the tool (Cold-Start) and then optimize for recommendation accuracy (Recommendation-Oriented)

Architecture

The overall architecture of DeepRec, illustrating the autonomous multi-turn reasoning-retrieval process.

Breakthrough Assessment

7/10

Novel application of agentic patterns (Deep Research) to recommendation. The multi-turn retrieval with a modified TRM tool is a logical but significant step forward from simple re-ranking.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation: Predict the next item y given a chronological sequence of historical interactions X = [v1, ..., vn].

Inputs: User interaction history X

Outputs: Ranked list of candidate items culminating in the predicted next item y

Pipeline Flow

Reasoning Generation (LLM)
Preference Generation (LLM)
Item Retrieval (Preference-Aware TRM Tool)
Iterative Reasoning loop (optional)
Final Ranking (LLM)

System Modules

Reasoning & Preference Generator

Analyzes history and previous tool outputs to generate thoughts and specific user preference descriptions

Model or implementation: Large Language Model (specific architecture not detailed in snippet)

Preference-Aware TRM

Retrieves candidate items from the full item space based on the LLM's generated preference

Model or implementation: Modified TRM (e.g., SASRec)

Ranking Generator

Ranks the accumulated retrieved items to produce the final recommendation

Model or implementation: Large Language Model (same backbone as Generator)

Novel Architectural Elements

Integration of a Preference-Aware TRM where the retrieval query vector is a fusion of the sequence embedding (from TRM) and the text embedding (from LLM preference)

Modeling

Base Model: Large Language Model (specific variant not reported in snippet)

Training Method: Reinforcement Learning (Reinforce++)

Objective Functions:

Purpose: Enforce valid interaction syntax.

Formally: Generation Format Reward (binary checks for tags like <think>, <preference>, <recommendation_list>).
Purpose: Encourage diverse exploration of item space.

Formally: Preference Diversity Reward (1 - cosine similarity between generated preference embeddings).
Purpose: Optimize recommendation accuracy (Point-wise).

Formally: Weighted average of Textual Similarity and Collaborative Similarity between predicted items and ground truth.
Purpose: Optimize recommendation accuracy (List-wise).

Formally: Hit reward (binary) and Rank reward (linear decay based on position).

Training Data:

Trajectories collected by letting the LLM interact with the Preference-Aware TRM
Difficulty-based filtering: Samples where the TRM's predicted rank for the ground truth is > 100 are discarded

Key Hyperparameters:

difficulty_threshold_rank: 100

Comparison to Prior Work

vs. TRM-enhanced LLM: DeepRec uses *multi-turn* interactions where the LLM actively guides the TRM with generated preferences, rather than a single passive retrieval step.
vs. LLM as RS: DeepRec relies on a TRM for retrieval, avoiding the need to fine-tune the LLM on the entire item set and reducing hallucination.

Limitations

Dependency on the quality of the TRM; if the underlying TRM is weak, the agent's tool use is ineffective
Computational cost of multi-turn interactions is likely higher than single-pass methods (though specific latency numbers not reported in snippet)
Requires complex reward engineering (hierarchical rewards) to stabilize RL training

Reproducibility

Code: https://github.com/RUCAIBox/DeepRec

Publicly available code at https://github.com/RUCAIBox/DeepRec. The snippet provided describes the method (Section 2) but cuts off before Experimental Setup and Hyperparameters (Section 3/4), so specific model sizes (e.g., Llama-2-7B) and training hyperparameters are not available in this summary context.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation predicting the next item in a user's history sequence.

Benchmarks:

Public sequential recommendation datasets (Sequential Item Prediction)

Metrics:

Hit rate
Rank metrics (implied by reward function, likely NDCG/MRR)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper claims DeepRec significantly outperforms traditional and existing LLM-based baselines.
The method introduces a two-stage training strategy: Cold-start (to learn interaction formats) and Recommendation-oriented (to optimize ranking metrics).
Process-level rewards (format, count, diversity) are crucial for the stability of the agentic interaction before optimizing for outcome.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation
Large Language Models (LLMs) and Prompting
Reinforcement Learning (Reinforce++)

Key Terms

TRM: Traditional Recommendation Model—classic deep learning models (like SASRec) trained on interaction data to predict user preferences

SASRec: Self-Attentive Sequential Recommendation—a specific TRM architecture that uses self-attention to model user interaction sequences

Preference-Aware TRM: A modified TRM proposed in this paper that combines interaction history embeddings with text embeddings of the LLM-generated user preference

Reinforce++: A reinforcement learning algorithm used here to optimize the LLM's policy for both interaction validity and recommendation accuracy

Point-wise reward: A reward signal evaluating the quality of individual recommended items based on textual and collaborative similarity to the ground truth

List-wise reward: A reward signal evaluating the overall quality of the generated item list, using metrics like Hit rate and Rank position

Cold-Start RL: The first training stage focused on teaching the LLM the correct format and interaction pattern for invoking the TRM tool

Deep Research: An AI agent concept referenced as inspiration, where LLMs autonomously interact with tools (like search engines) to solve complex tasks