CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent

📝 Paper Summary

Adversarial Attacks on Recommender Systems LLM-based Agents

CheatAgent employs a Large Language Model as an autonomous attack agent to generate adversarial textual perturbations that mislead black-box LLM-empowered recommender systems into making incorrect recommendations.

Core Problem

LLM-empowered recommender systems are vulnerable to adversarial attacks, but traditional Reinforcement Learning (RL) attackers fail because they lack the language understanding and reasoning capabilities to manipulate complex textual inputs effectively.

Why it matters:

LLM-empowered RecSys are increasingly deployed in high-stakes environments (finance, healthcare), making safety vulnerabilities critical.
Existing black-box attackers (RL-based) cannot process textual item titles/descriptions or reason about context, leaving a gap in evaluating LLM-RecSys robustness.
Security concerns require testing systems under practical black-box settings where model weights are inaccessible.

Concrete Example: In a recommender system prompt asking for the top item for 'User_637' who liked 'item_1009', an attacker wishes to insert specific words or fake items into the history to force the system to recommend an irrelevant 'item_1072'. Traditional RL agents struggle to craft natural language perturbations to achieve this in a text-based prompt.

Key Novelty

CheatAgent Framework (LLM-as-Attacker)

Replaces traditional RL attack agents with an LLM-based agent that possesses human-like reasoning and open-world knowledge to craft effective textual perturbations.
Introduces an 'Insertion Positioning' strategy to identify optimal locations in the prompt for perturbation with minimal modification.
Utilizes a self-reflection policy optimization (via prompt tuning) to iteratively improve the attack strategy based on feedback from the victim system.

Architecture

Conceptual illustration of the CheatAgent framework attacking an LLM-empowered RecSys.

Breakthrough Assessment

7/10

This is the first work to investigate the safety vulnerability of LLM-empowered RecSys specifically using an LLM-based agent. It addresses the limitations of RL-based attacks in text-heavy contexts.

⚙️ Technical Details

Problem Definition

Setting: Black-box untargeted adversarial attack on LLM-empowered Recommender Systems

Inputs: Input prompt X containing prompt template P, user u, and interaction history V

Outputs: Adversarial perturbation delta (words or items) inserted into X to maximize loss L_Rec

Pipeline Flow

Insertion Positioning (Identify vulnerable locations)
LLM Agent-Empowered Perturbation Generation (Generate attack)
Victim RecSys (Receive attack and output prediction)

System Modules

Insertion Positioning

Identify the input positions where inserting perturbations will have maximum impact with minimal modification

Model or implementation: Not explicitly specified in text

LLM Attack Agent

Generate adversarial perturbations (text or items) to insert at the identified positions

Model or implementation: LLM (Specific architecture not detailed in snippet)

Self-Reflection Optimization

Fine-tune the attack policy using feedback from the victim RecSys to bridge the domain knowledge gap

Model or implementation: Policy Optimization via Prompt Tuning

Novel Architectural Elements

Integration of an LLM as the adversarial agent specifically for attacking other LLM-based systems (Agent-vs-Agent)
Two-stage framework combining explicit position identification with generative perturbation crafting

Modeling

Base Model: LLM (Architecture not specified in snippet, likely separate from victim)

Training Method: Prompt Tuning / Self-Reflection Policy Optimization

Objective Functions:

Purpose: Maximize the recommendation loss of the victim system to cause incorrect predictions.

Formally: delta = argmax L_Rec(X_hat, Y) subject to Hamming distance constraint.

Adaptation: Prompt Tuning

Trainable Parameters: Prompts / Soft Prompts (implied by 'prompt tuning')

Compute: Not reported in the paper

Comparison to Prior Work

vs. KGAttack/PoisonRec: CheatAgent uses an LLM agent instead of RL, enabling it to handle complex textual inputs and context which RL agents fail to process effectively.
vs. Traditional Methods: CheatAgent incorporates open-world knowledge and reasoning via the LLM, whereas traditional methods optimize from scratch without such priors.

Limitations

Computational cost of using an LLM as an attack agent is likely higher than simple RL agents.
The approach relies on the 'black-box' assumption where outputs are observable; if outputs are hidden, the feedback loop breaks.
Success depends on the LLM agent's ability to align its domain knowledge with the specific RecSys domain.

Reproducibility

No replication artifacts mentioned in the paper snippet. The text refers to 'three real-world datasets' but does not name them in the provided text. Specific hyperparameters and model weights are not provided in the text.

📊 Experiments & Results

Evaluation Setup

Black-box attack on LLM-empowered RecSys

Benchmarks:

Three real-world datasets (Sequential Recommendation / Top-N Recommendation)

Metrics:

Recommendation Loss (Negative Log-Likelihood)
Recommendation Performance (Implicitly HR/NDCG, though not explicitly named in snippet)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper claims extensive experiments demonstrate the safety vulnerability of LLM-empowered RecSys against the proposed attacks.
The method focuses on 'untargeted attacks' aiming to promote irrelevant items.
The framework allows for both prompt perturbations (adding words) and profile perturbations (adding fake item interactions).

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (RecSys)
Adversarial Attacks (Black-box setting)
Large Language Models (LLMs) and Prompt Tuning

Key Terms

LLM-empowered RecSys: Recommender systems that use Large Language Models to encode textual information (reviews, titles) and reason about user preferences

Black-box attack: An attack setting where the adversary has no access to the victim model's gradients or parameters, only inputs and outputs

Perturbation: Malicious modifications (e.g., inserted words or fake item interactions) added to the input data to deceive the model

RL-based agents: Reinforcement Learning agents used in prior work to attack recommender systems by learning policies to inject malicious items

Hamming distance: A metric used here to constrain the magnitude of the attack, measuring the difference between the original and adversarial input

Self-reflection: A mechanism allowing the LLM agent to analyze feedback from the victim system to refine its future attack strategies