Does LLM Focus on the Right Words? Mitigating Context Bias in LLM-based Recommenders

📝 Paper Summary

LLM-based Recommendation Bias and Fairness in Recommender Systems

GDRT is a fine-tuning strategy that uses Group Distributionally Robust Optimization to force LLMs to rely on user interaction history rather than shortcut correlations with auxiliary prompt text.

Core Problem

Supervised fine-tuning causes LLM recommenders to over-rely on static 'auxiliary tokens' (task descriptions, prefixes) instead of user-specific interaction history, leading to 'context bias'.

Why it matters:

Recommendations become non-personalized because the model ignores the user's history in favor of prompt artifacts
Creates unfairness by recommending only items whose titles happen to correlate with the fixed task prompt instructions
Standard fine-tuning amplifies this bias significantly (e.g., attribution ratio shifts from 1:1 to 6:1 on Amazon datasets)

Concrete Example: In a dataset where the prompt always contains 'prediction:', the LLM learns to predict items that frequently co-occur with the word 'prediction:' rather than items matching the user's history. As a result, 80% of recommendations might come from just the top 20% of items that have high semantic overlap with the prompt text.

Key Novelty

Group Distributionally Robust Optimization for Tuning (GDRT)

Groups training samples based on how strongly the target item correlates with the auxiliary prompt text (measured by the LLM's probability when history is masked)
Applies Group DRO to dynamically upweight the loss of 'hard' groups (items with weak correlation to the prompt), forcing the model to learn from user history instead of taking the easy shortcut

Architecture

Illustrates the concept of Context Bias in LLM-based recommendation.

Evaluation Highlights

Achieves an average NDCG@10 gain of 24.29% across three public datasets compared to standard SFT
Reduces unfairness significantly: standard deviation of group performance drops from ~0.08 (SFT) to ~0.01 (GDRT) on Amazon Beauty
Outperforms state-of-the-art bias mitigation method (CFT) by substantial margins (e.g., +0.0345 NDCG@10 on Amazon Beauty)

Breakthrough Assessment

8/10

Identifies a distinct, previously overlooked bias type ('Context Bias') in LLM recommenders and provides a theoretically grounded, highly effective solution that improves both accuracy and fairness simultaneously.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation using LLMs as a generative backbone

Inputs: Prompt x = [task description; user interaction history]

Outputs: Target item description y

Pipeline Flow

Prompt Construction: Combine task description, user history, and item prefix
Group Assignment (Pre-training): Classify items into groups based on relevance to auxiliary tokens
Training (GDRT): Fine-tune LLM using Group DRO objective to balance performance across groups

System Modules

LLM Backbone

Generates item recommendations token-by-token

Model or implementation: Llama-3-8B-Instruct (or similar LLMs like Baichuan2-7B, DeepSeek-7B)

Group Assigner

Assigns each training instance to a group based on 'shortcut' strength

Model or implementation: Same LLM (inference only, masked history)

Novel Architectural Elements

Dynamic re-weighting of training samples based on 'context bias' groups derived from non-personalized item probabilities

Modeling

Base Model: Llama-3-8B-Instruct (also tested with Llama-3-8B, Baichuan2-7B, DeepSeek-7B)

Training Method: Group Distributionally Robust Optimization (GDRT)

Objective Functions:

Purpose: Maximize worst-case performance across groups to prevent reliance on shortcuts.

Formally: minimize max_g (L_g(theta)) where L_g is the expected loss of group g.
Purpose: Standard next-token prediction loss within groups.

Formally: Negative Log-Likelihood of target tokens.

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters only

Training Data:

Items divided into 5 groups based on relevance to auxiliary tokens (using masked history probability)
Amazon Beauty, Clothing, Toys datasets

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
group_count: 5
+ 1 more
step_size_eta_q: Not explicitly reported in the paper

Compute: High efficiency compared to baselines like CFT; requires only one extra inference pass per item for grouping

Comparison to Prior Work

vs. SFT: GDRT explicitly prevents the model from learning shortcuts from prompt templates
vs. CFT: GDRT aligns directly with the target objective without needing complex counterfactual data generation or manual weight tuning; GDRT is computationally cheaper (1x training time vs ~2x for CFT)
vs. MACRec/TallRec [not cited in paper]: GDRT is a training strategy applicable to any of these architectures, not a new architecture itself

Limitations

Requires pre-computation of groups based on item probabilities (though efficient)
Focuses specifically on textual/token-level bias in LLMs, may not address ID-based biases
Experiments limited to Amazon datasets (Beauty, Clothing, Toys)

Reproducibility

Code: https://github.com/WANGBohaO-jpg/GDRT

Code is publicly available at https://github.com/WANGBohaO-jpg/GDRT. Key hyperparameters (LR, batch size) are not detailed in the main text but code is provided. Base models are open weights.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation predicting the next item title

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Clothing (Sequential Recommendation)
Amazon Toys (Sequential Recommendation)

Metrics:

NDCG@10
HR@10 (Hit Rate)
Group Performance Standard Deviation (Fairness metric)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main accuracy comparison shows GDRT consistently outperforming SFT and the robust baseline CFT across all datasets.
Amazon Beauty	NDCG@10	0.0402	0.0805	+0.0403
Amazon Beauty	NDCG@10	0.0460	0.0805	+0.0345
Amazon Clothing	NDCG@10	0.0416	0.0718	+0.0302
Amazon Toys	NDCG@10	0.0470	0.0617	+0.0147
Fairness analysis showing performance gap reduction between groups.
Amazon Beauty	Std Dev of Group Performance	0.08	0.01	-0.07

Experiment Figures

Feature Ablation Attribution (FAA) ratios showing the relative importance of Auxiliary vs. Interaction tokens.

Percentage of recommended items coming from different 'Shortcut Groups' (Group 1 = strongest shortcut).

Main Takeaways

SFT induces severe context bias, shifting model attention from user history (1:1 ratio) to auxiliary prompt tokens (1:6 ratio).
GDRT successfully re-aligns attention; after GDRT, the ratio returns closer to 1:1 or favors interaction tokens.
GDRT improves fairness: while SFT recommends 80% of items from the 'high shortcut' group (Group 1), GDRT distributes recommendations more evenly across item groups.
GDRT is efficient: training time is comparable to SFT and much faster than CFT (Counterfactual Fine-Tuning).

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of LLM fine-tuning (SFT)
Sequential recommendation formulation
Concept of Distributionally Robust Optimization (DRO)

Key Terms

Context Bias: A specific bias where the model over-relies on static prompt text (auxiliary tokens) rather than the dynamic user history input

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset using standard log-likelihood maximization

Group DRO: Group Distributionally Robust Optimization—an optimization technique that minimizes the worst-case loss across predefined groups of data to ensure robust performance

Auxiliary Tokens: Fixed parts of the prompt template (e.g., 'Task: Recommend item', 'Output:') that do not carry user-specific information

Interaction Tokens: Tokens representing the user's actual historical behavior (e.g., titles of previously watched movies)

NDCG@10: Normalized Discounted Cumulative Gain at rank 10—a measure of ranking quality where higher positions are worth more

Short-cut learning: When a model solves a task by relying on spurious correlations (like simple word co-occurrence) rather than the intended reasoning path

FAA: Feature Ablation Attribution—a method to measure how much a model relies on specific input tokens by masking them and observing output changes