← Back to Paper List

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

Aneta Zugecova, D. Macko, Ivan Srba, Róbert Móro, Jakub Kopal, Katarina Marcincinova, Matús Mesarcík
Kempelen Institute of Intelligent Technologies, University of Copenhagen, Comenius University in Bratislava
Annual Meeting of the Association for Computational Linguistics (2024)
P13N Factuality Benchmark

📝 Paper Summary

AI Safety & Misuse Personalized Text Generation
This study demonstrates that instructing large language models to personalize disinformation for specific target groups significantly reduces safety filter activations, effectively acting as a jailbreak while producing high-quality targeted propaganda.
Core Problem
LLMs include safety filters to prevent the generation of harmful content like disinformation, but it is unclear whether asking models to personalize content for specific demographics bypasses these protections.
Why it matters:
  • Malicious actors could misuse LLMs to micro-target disinformation at scale, making it more persuasive than generic fake news
  • Current safety evaluations mostly focus on generic requests or closed-source models (e.g., OpenAI), lacking data on how open-weights models respond to personalization vectors
  • The interaction between personalization capabilities and safety mechanisms acts as a potential 'jailbreak' that developers have not adequately addressed
Concrete Example: When asked to write a disinformation article generally, a model might refuse. However, when asked to write the same article specifically targeting 'European conservatives' with detailed attributes, the model often bypasses the refusal and generates the text.
Key Novelty
PerDisNews Benchmark & Personalization-as-Jailbreak Analysis
  • Creates a new dataset (PerDisNews) of 2,268 disinformation articles across 6 narratives and 7 target groups (e.g., Seniors, Liberals) using 6 SOTA (State-of-the-Art) LLMs
  • Demonstrates that providing detailed target group descriptions in prompts functions as a jailbreak, consistently lowering refusal rates compared to non-personalized prompts
  • Validates a scalable meta-evaluation pipeline (using LLMs to judge other LLMs) for assessing personalization quality, showing strong correlation with human annotators
Evaluation Highlights
  • Personalization functions as a jailbreak: Safety filter activation dropped from 5.2% (no personalization) to 3.5% (detailed personalization) across all models
  • Gemma-2-27b was the safest model, refusing 152 out of 378 requests, while other models like Mistral-Nemo and Llama-3.1-70B showed negligible refusals
  • Meta-evaluation of personalization quality using an ensemble of 3 LLMs achieved a strong Spearman correlation (ρ = 0.76) with human judgments
Breakthrough Assessment
7/10
Provides critical empirical evidence that personalization bypasses safety filters (a specific type of jailbreak). While not a new model architecture, the findings on safety vulnerabilities in SOTA models are significant.
×