ITDR: An Instruction Tuning Dataset for Enhancing Large Language Models in Recommendations

📝 Paper Summary

LLM-based Recommendation Instruction Tuning

ITDR is a large-scale instruction tuning dataset comprising nearly 200,000 instances across seven subtasks, designed to bridge the gap between user behavior data and LLM natural language understanding.

Core Problem

LLMs struggle with recommendation tasks because traditional data (IDs) lacks natural language structure, and existing instruction datasets are too small or lack diverse task descriptions.

Why it matters:

Structural discrepancy between ID-based behavior records and natural language limits LLM effectiveness in modeling user preferences
Existing datasets lack structured task descriptions essential for guiding LLMs, hindering generalization to varied recommendation scenarios
Current methods face a 'data bottleneck' where training data fails to cover multifaceted task scenarios and user behavior patterns

Concrete Example: Traditional datasets provide only ID sequences (e.g., User 123 clicked Item 456), which contain no semantic information for an LLM to reason about, unlike ITDR's natural language instructions that explicitly describe the task and context.

Key Novelty

Standardized Instruction Tuning Dataset for Recommendation (ITDR)

Unifies recommendation data into two root tasks: User-Item Interaction (predicting preferences) and User-Item Understanding (profiling items/users)
Transforms 13 classic benchmarks into standardized natural language templates with specific task descriptions to guide LLM reasoning

Architecture

Taxonomy of the ITDR dataset showing the division into Root Tasks and Subtasks

Evaluation Highlights

Constructed a dataset of 195,065 high-quality instructions across 7 distinct subtasks
Integrates data from 13 diverse public recommendation benchmarks (e.g., MovieLens 32M, Amazon Reviews, PixelRec)
Validates effectiveness on mainstream models including GLM-4, Qwen2.5, and LLaMA-3.2 (qualitative result from abstract, specific numbers not in provided text)

Breakthrough Assessment

7/10

Addresses a critical data bottleneck in LLM-RecSys with a large-scale, structured resource. While the method is standard instruction tuning, the dataset scale and taxonomy are significant contributions.

⚙️ Technical Details

Problem Definition

Setting: Instruction tuning for recommendation systems

Inputs: Natural language instruction x containing task description and input data (from converted templates)

Outputs: Target response y (recommendation result or attribute analysis)

Pipeline Flow

Data Collection (13 Benchmarks)
Template Construction (Task Descriptions + Input Formatting)
Ground Truth Generation (via DeepSeek-V3 for UIU tasks)
Instruction Tuning (LLM + LoRA)

System Modules

Template Converter (Data Processing)

Transforms raw ID-based interaction records into natural language input-output pairs (x, y)

Model or implementation: Rule-based templates

Label Generator (Data Processing)

Generates reference ground truth for tasks where original data is missing labels

Model or implementation: DeepSeek-V3

Recommender LLM

Generates recommendation predictions or user analysis based on instructions

Model or implementation: GLM-4 / Qwen2.5 / LLaMA-3.2 (with LoRA adapters)

Novel Architectural Elements

Taxonomy dividing recommendation into User-Item Interaction (UII) and User-Item Understanding (UIU) root tasks

Modeling

Base Model: GLM-4, Qwen2.5, Qwen2.5-Instruct, LLaMA-3.2

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Maximize probability of generating target tokens given instruction.

Formally: Maximize sum of log P(y_t | x, y_<t)

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

195,065 total instructions
Sources: MovieLens 32M, Amazon Reviews 2023, BookCrossing, Yelp, Anime Dataset, MicroLens, PixelRec, MIND, Last.FM, Steam

Key Hyperparameters:

rank_r: Significantly smaller than d_in/d_out (exact value not reported in text provided)
alpha: Tunable scaling factor (exact value not reported in text provided)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Chat-REC/RecPrompt: ITDR uses instruction tuning rather than just prompt engineering (zero/few-shot)
vs. Llmrec: ITDR fine-tunes the LLM itself rather than just extracting features for a traditional recommender
vs. TallRec [not cited in snippet but implied as 'instruction tuning' related work]: ITDR covers a broader taxonomy (UIU + UII) and uses standardized templates across 13 datasets

Limitations

Reliance on DeepSeek-V3 for generating ground truth in UIU tasks (Interest Recognition, Target User Identification) potentially introduces bias
Effectiveness depends on the quality of manually crafted templates
Specific quantitative performance gains (metrics like RMSE/Accuracy) are not available in the provided text snippet

Reproducibility

Code: https://github.com/hellolzk/ITDR

Dataset and code publicly available at https://github.com/hellolzk/ITDR. Raw data sources are public benchmarks. Synthetic labels generated by DeepSeek-V3.

📊 Experiments & Results

Evaluation Setup

Instruction tuning evaluation across multiple recommendation subtasks

Benchmarks:

MovieLens 32M (Rating Prediction, Top-K, Next Item)
Amazon Reviews 2023 (Top-K, Cross-Domain, Next Item)
BookCrossing (Rating Prediction, User Attribute Prediction)
Last.FM 1/360K (User Attribute Prediction)

Metrics:

Not explicitly reported in the paper
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided text does not contain specific performance result tables (e.g., RMSE, Accuracy numbers). The following entries reflect the dataset scale statistics which are explicitly reported.
ITDR	Total Instructions	0	195065	+195065
ITDR	Number of Subtasks	0	7	+7

Main Takeaways

ITDR successfully integrates 13 public datasets into a unified instruction tuning format
The dataset covers two root perspectives: User-Item Interaction (UII) and User-Item Understanding (UIU)
DeepSeek-V3 was effective in generating synthetic ground truth for understanding tasks (UIU), validated by human review of sampled instances

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (Collaborative Filtering, IDs)
Instruction Tuning for LLMs
Parameter-Efficient Fine-Tuning (LoRA)

Key Terms

UII: User-Item Interaction—a root task category focusing on predicting user behavior like ratings, clicks, or next items

UIU: User-Item Understanding—a root task category focusing on analyzing item attributes and inferring user profiles/interests

LoRA: Low-Rank Adaptation—a technique to fine-tune LLMs efficiently by updating only small low-rank matrices instead of all weights

Instruction Tuning: Training LLMs on (instruction, output) pairs to improve their ability to follow specific task directions

DeepSeek-V3: A large language model used in this paper to generate ground-truth labels for tasks where original datasets lacked them (e.g., Interest Recognition)