Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

📝 Paper Summary

LLM-based Recommender Systems Benchmark Construction

RecBench+ is a benchmark dataset comprising approximately 30,000 complex, conversational queries designed to evaluate Large Language Models' ability to reason and act as personalized recommendation assistants.

Core Problem

Traditional recommender evaluations rely on simple ID-based tasks or rigid prompt templates (e.g., 'Will user like X?'), which fail to test the reasoning and conversational capabilities required for intelligent recommendation assistants.

Why it matters:

Real-world users have complex needs involving multi-hop reasoning (e.g., 'movies by the cinematographer of X') that simple ID matching cannot capture
Existing datasets like MovieLens lack high-quality textual queries, limiting the development of interactive LLM-based assistants
Current evaluation paradigms relying on fixed templates do not assess an agent's ability to handle misleading information or implicit user preferences

Concrete Example: A user asks, 'Recommend movies with the same cinematographer as Stay Hungry.' A traditional model only sees user-item IDs and fails. An LLM assistant must infer the cinematographer (David Worth) and find related items, a capability not tested by current benchmarks.

Key Novelty

RecBench+: A Knowledge Graph-Grounded Benchmark for Complex Recommendation

Categorizes user needs into Condition-based (explicit, implicit, misinformed) and User Profile-based (interest, demographics) queries to test diverse reasoning levels
Leverages Knowledge Graphs (KG) to extract shared relations from user history, ensuring ground-truth accuracy for complex constraints before generating natural language queries

Architecture

The data construction pipeline for Condition-based Queries using a Knowledge Graph

Evaluation Highlights

Evaluation of 7 LLMs reveals that models perform better on explicitly stated conditions than on queries requiring multi-hop reasoning or correction of misleading info
Fine-tuning (Supervised + Reinforcement) notably improves performance, with a two-stage approach outperforming SFT alone
Models show demographic performance variance, generally performing better for female users and popular interests

Breakthrough Assessment

8/10

Addresses a critical gap in evaluating LLM-based recommenders by moving beyond ID prediction to complex reasoning. The construction methodology using KGs for ground truth is robust.

⚙️ Technical Details

Problem Definition

Setting: Personalized recommendation via natural language interaction

Inputs: User interaction history H_u and a complex natural language query q

Outputs: A list of recommended items satisfying the query constraints and user preferences

Pipeline Flow

Data Generation Pipeline: Item KG Construction → Shared Relation Extraction → Condition Construction → Query Generation

System Modules

Item KG Construction (Data Generation)

Build structured links between items and attributes to serve as ground truth

Model or implementation: Wikipedia Extraction / Amazon Metadata

Shared Relation Extraction (Data Generation)

Identify common attributes in a user's history to form realistic query constraints

Model or implementation: Rule-based Retrieval function R

Condition Constructor (Data Generation)

Transform shared relations into Explicit, Implicit, or Misinformed constraints

Model or implementation: Algorithmic transformation

Query Generator (Data Generation)

Synthesize natural language queries based on constructed conditions

Model or implementation: GPT-4o

Novel Architectural Elements

KG-driven query synthesis pipeline that generates ground-truth supported complex queries (Implicit/Misinformed) rather than just metadata filtering

Modeling

Base Model: Evaluated: GPT-4o, Gemini-1.5-Pro, DeepSeek-R1, LLaMA (specific version not detailed in snippet)

Training Method: Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)

Adaptation: Fine-tuning applied to open-source models

Trainable Parameters: Not reported in the paper

Training Data:

RecBench+ dataset (~30,000 queries)
Built from MovieLens-1M and Amazon-Book

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaRA: RecBench+ evaluates complex reasoning and conversational interaction (e.g., implicit conditions, corrections), whereas LLaRA focuses on standard history-to-item prediction
vs. Traditional MF: RecBench+ targets natural language queries with logical constraints, which ID-based MF cannot process directly

Limitations

Relies on GPT-4o for query generation, which may introduce biases or stylistic artifacts
Dataset construction depends on the completeness of the underlying Knowledge Graph (Wikipedia/Amazon metadata)
Evaluation is limited to movie and book domains

Reproducibility

Code: https://github.com/jiani-huang/RecBenchPlus

Dataset available at https://github.com/jiani-huang/RecBenchPlus. The paper describes the data construction process in detail using MovieLens-1M and Amazon-Book sources.

📊 Experiments & Results

Evaluation Setup

LLM-based personalized recommendation assistant responding to complex natural language queries

Benchmarks:

RecBench+ (Complex Query Recommendation) [New]

Metrics:

Not explicitly reported in the paper
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

LLMs excel at explicit condition queries but struggle significantly with implicit reasoning or queries containing misleading information
Fine-tuning (specifically a two-stage SFT + RFT approach) provides the best performance boost for open-source models
Performance is uneven across user demographics; models tend to align better with female users and popular interests, indicating potential bias in profile understanding

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (RecSys) fundamentals
Knowledge Graphs (KG) structure and relations
Large Language Models (LLMs) prompting and fine-tuning

Key Terms

RecSys: Recommender Systems—algorithms designed to suggest relevant items to users

MF: Matrix Factorization—a traditional collaborative filtering technique that decomposes user-item interaction matrices to predict preferences

KG: Knowledge Graph—a structured representation connecting entities (e.g., movies) to attributes (e.g., directors, genres) via relations

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to adapt it to a specific task

RFT: Reinforcement Fine-Tuning—optimizing a model using reinforcement learning signals (rewards) to align it with complex goals

Explicit Condition Query: Queries where constraints are directly stated (e.g., 'Directed by Spielberg')

Implicit Condition Query: Queries requiring reasoning to deduce constraints (e.g., 'Directed by the person who made Jaws')

Misinformed Condition Query: Queries containing factual errors the model must identify and correct before recommending

Multi-hop reasoning: The process of connecting multiple pieces of information (e.g., Movie -> Director -> Other Movies) to answer a query