What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context

📝 Paper Summary

Sequential Recommendation LLM Preference Alignment

RecPO enhances LLM-based recommendation by replacing binary preference optimization with adaptive reward margins that explicitly model graded preference intensity and temporal recency.

Core Problem

Current LLM recommenders rely on binary preference alignment (DPO), which treats all positive interactions equally and ignores temporal dynamics, failing to capture graded user preferences (e.g., love vs. like) and the priority of immediate satisfaction.

Why it matters:

Binary abstraction discards critical information about the strength of user aversion or affinity, leading to suboptimal ranking
Ignoring temporal context causes models to recommend items that users might like eventually but do not want immediately (delayed gratification vs. immediate relevance)
Standard alignment methods (like DPO) treat historical negatives as noise and filter them out, missing the opportunity to learn from what users explicitly reject

Concrete Example: A user rates a movie 5 stars (Strongly Love) and another 3 stars (Mildly Like). A binary DPO model treats both as 'positive' target items. Furthermore, if the 5-star movie was watched years ago and the 3-star one yesterday, the model might incorrectly prioritize the older interest, ignoring the user's current context.

Key Novelty

RecPO (Recommender Preference Optimization)

Introduces a 'preference intensity' factor into alignment, weighing training examples based on structured signals (e.g., star ratings) rather than just binary labels
Incorporates 'temporal context' by decaying the reward margin for older interactions, forcing the model to prioritize recent relevance over historical preferences
Utilizes a multi-negative ranking objective that preserves negative interaction history (unlike standard S-DPO), allowing the LLM to learn nuanced avoidance behaviors

Architecture

The RecPO framework pipeline, illustrating how user history and candidate items are processed with structured feedback (ratings) and temporal context to compute adaptive reward margins.

Evaluation Highlights

Outperforms S-DPO by +5.49% Hit Ratio@1 on MovieLens-1M using LLaMA3-8B
Achieves +11.1% Hit Ratio@1 improvement over S-DPO on LastFM (implicit feedback dataset) using LLaMA3-8B
Demonstrates superior 'Avoidance Rate' (rejecting low-rated future items), surpassing S-DPO and SFT baselines across MovieLens and Steam datasets

Breakthrough Assessment

7/10

Provides a well-motivated, cognitively grounded improvement to DPO for recommendation. The gains are consistent and the analysis of temporal/intensity factors is insightful, though the core mechanism is a relatively straightforward modification of the reward margin.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation as a Next-Token Prediction task

Inputs: User interaction history sequence containing item titles and structured ratings (explicit or proxied)

Outputs: The title of the next item the user will interact with

Pipeline Flow

Prompt Construction (History + Ratings + Candidates)
LLM Inference (Next Item Prediction)
Ranking (Plackett-Luce based)

System Modules

Prompt Constructor

Formats user history into a text sequence, explicitly including ratings and preserving negative items

Model or implementation: Deterministic rule-based formatter

LLM Backbone

Processes the prompt to predict the next preferred item

Model or implementation: LLaMA3-8B or Qwen-7B

Modeling

Base Model: LLaMA3-8B and Qwen-7B

Training Method: Preference Alignment (modified DPO)

Objective Functions:

Purpose: Dynamically scale the reward margin based on rating and recency.

Formally: Margin γ_r is derived from utility φ(s, Δt) = s / (Δt)^0.5, where s is the preference score and Δt is time latency.
Purpose: Align the model to rank items based on the adaptive margins.

Formally: Optimized using a Plackett-Luce (PL) generalized objective over a list of 1 positive and K-1 negative items.

Adaptation: Full fine-tuning (implied by context of SFT/DPO comparisons)

Training Data:

Training: History excluding last 2 items
Validation: Second to last item
Test: Last item
Candidate set: 10 positives + 10 negatives during training; 1 target + 19 random negatives during inference

Key Hyperparameters:

candidate_set_size_training: 20 (10 positive sequence + 10 random)
candidate_set_size_inference: 20 (1 target + 19 random)
temporal_decay_alpha: 0.5 (in utility function)

Compute: 8 NVIDIA RTX A100 (80GiB VRAM)

Comparison to Prior Work

vs. S-DPO: RecPO uses graded margins (ratings) and temporal decay, whereas S-DPO uses fixed binary margins and filters negative history.
vs. SimPO: RecPO maintains high Valid Ratio (instruction following), whereas SimPO degrades it significantly in recommendation tasks.
vs. Traditional (SASRec/BERT4Rec): RecPO leverages semantic understanding of item titles and nuanced preference signals, outperforming them significantly on explicit feedback datasets.

Limitations

Performance gains over traditional models (like SASRec) are smaller on implicit feedback datasets where preference signals are proxied.
Requires datasets with timestamp information to calculate temporal margins effectively.
Computational cost is higher than traditional lightweight recommenders due to LLM inference.

Reproducibility

Code: https://anonymous.4open.science/r/RecPO-020A/

Code is publicly available at anonymous.4open.science/r/RecPO-020A/. Datasets are public benchmarks (MovieLens, Amazon, Steam, BeerAdvocate, LastFM). Exact learning rates and batch sizes are referenced as being in Appendix E.3 (not in snippet).

📊 Experiments & Results

Evaluation Setup

Sequential next-item prediction ranking task

Benchmarks:

MovieLens-1M (Movie Recommendation (Explicit Ratings))
Amazon-Books (Book Recommendation (Explicit Ratings))
BeerAdvocate (Beer Recommendation (Explicit Ratings))
Steam (Game Recommendation (Implicit - Play Hours))
LastFM (Music Recommendation (Implicit - Play Counts))

Metrics:

Hit Ratio@1 (HR@1)
Valid Ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison using LLaMA3-8B backbone shows RecPO outperforming baselines across all datasets.
MovieLens-1M	HR@1	0.2902	0.3451	+0.0549
Amazon-Books	HR@1	0.5065	0.5802	+0.0737
LastFM	HR@1	0.5719	0.6830	+0.1111
Main comparison using Qwen-7B backbone confirms method generalizability.
MovieLens-1M	HR@1	0.2706	0.3446	+0.0740
Valid Ratio analysis showing RecPO maintains instruction following better than SimPO.
Amazon-Books	Valid Ratio	0.9564	0.9851	+0.0287

Experiment Figures

A proof-of-concept ablation study comparing performance with/without negative items and with/without ratings.

Main Takeaways

Incorporating negative items (comprehensive feedback) and ratings (structured signals) is crucial; removing negatives (as in S-DPO) hurts performance.
RecPO consistently improves temporal adherence, prioritizing immediate next items over future highly-rated items significantly better than SFT or S-DPO.
The method demonstrates strong 'avoidance' capabilities, effectively identifying and down-ranking tempting but ultimately disliked items.
RecPO exhibits higher stability and lower variance in performance across users with varying interaction history lengths compared to S-DPO.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation
Large Language Models (LLMs)
Direct Preference Optimization (DPO)

Key Terms

RecPO: Recommender Preference Optimization—the proposed framework that uses adaptive margins based on preference intensity and time

DPO: Direct Preference Optimization—a method to align LLMs with human preferences without a separate reward model

S-DPO: Sequential DPO—a prior adaptation of DPO for recommendation that pairs positive items with random negatives

SFT: Supervised Fine-Tuning—the initial phase of training the LLM on the prediction task before preference alignment

Preference Intensity: The graded strength of a user's liking for an item (e.g., a 5-star rating vs. a 3-star rating)

Temporal Context: The relative recency of an interaction, used to weight the relevance of user preferences

Hit Ratio@1: A metric measuring the percentage of test cases where the model's top-ranked item matches the ground truth

Valid Ratio: The proportion of model outputs that follow formatting rules and generate a valid item from the candidate set