Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

📝 Paper Summary

AI Safety Survey Adversarial Robustness Model Alignment

This survey establishes a comprehensive taxonomy of safety threats and defenses across six major large model categories—including VFMs, LLMs, and Agents—analyzing 574 papers to identify critical gaps in defense research.

Core Problem

The rapid deployment of large models in critical applications has introduced diverse safety risks (adversarial attacks, jailbreaks, data poisoning) that are currently studied in isolation, lacking a unified perspective.

Why it matters:

Widespread deployment in healthcare and autonomous driving makes vulnerabilities (e.g., unintended behaviors, privacy leakage) physically and ethically dangerous
Current defense research lags significantly behind attack research (~40% vs ~60%), leaving systems exposed
Existing surveys are typically narrow, focusing only on single modalities like LLMs or specific threats like jailbreaking, missing the interconnected risks in multi-modal and agentic systems

Concrete Example: In Vision Foundation Models (VFMs), a 'Patch-Fool' attack can perturb individual image patches to manipulate attention scores and alter decisions. Similarly, in Agents, 'indirect prompt injection' can occur when an agent processes a malicious webpage, causing it to execute harmful instructions unbeknownst to the user.

Key Novelty

Unified Safety Taxonomy across Modalities

Integrates safety research for six distinct model types (VFMs, LLMs, VLPs, VLMs, DMs, Agents) under a single hierarchical framework
Standardizes attack definitions (e.g., identifying 'jailbreak' counterparts in both LLMs and Diffusion Models)
Explicitly categorizes Agent-specific threats (e.g., tool manipulation, memory corruption) which are often overlooked in general model surveys

Architecture

The hierarchical structure (Taxonomy) of the survey, organizing the field into Model Types -> Attack/Defense -> Specific Categories

Evaluation Highlights

Analyzed 574 technical papers, with 71.32% focused on LLMs, DMs, and Agents
Identified that research on attacks (~60%) significantly outweighs research on defenses (~40%)
Taxonomy covers 10 distinct attack types including emerging threats like energy-latency attacks and agent memory injection

Breakthrough Assessment

9/10

A foundational reference work. It is likely the most comprehensive taxonomy to date, unifying disjoint fields (vision, language, agents) and providing a clear structured roadmap for future safety research.

⚙️ Technical Details

Problem Definition

Setting: Systematic literature review and taxonomy construction

Inputs: 574 technical papers published primarily between 2023-2025

Outputs: Hierarchical taxonomy of attacks and defenses for 6 model categories

Pipeline Flow

Scope Definition (6 Model Types)
Threat Identification (10 Attack Types)
Defense Review
Gap Analysis

System Modules

Vision Foundation Models (VFMs) (Model Scope)

Review safety of ViT and SAM architectures

Model or implementation: ViT, SAM

Large Language Models (LLMs) (Model Scope)

Review safety of text-generation models

Model or implementation: GPT-4, Llama, etc.

Agents (Model Scope)

Review safety of autonomous systems using tools/memory

Model or implementation: AutoGPT, Agent-based systems

Novel Architectural Elements

Two-level taxonomy: Category (Threat Model) -> Subcategory (Technique) applied uniformly across all 6 model types
Integration of 'Agent Safety' as a top-level category alongside base models, acknowledging the shift from static models to dynamic systems

Comparison to Prior Work

vs. Slattery et al.: This survey focuses on technical attack/defense implementation details rather than high-level risk frameworks
vs. Zhang et al. & Liu et al.: Broader scope covering 6 model types (including Agents and Diffusion Models) rather than just one modality
vs. Older Adversarial ML Surveys: Shifts focus from pure perturbation robustness to modern semantic threats like jailbreaking and prompt injection

Limitations

Defense gap: The survey highlights a persistent lack of effective defenses compared to the volume of attack research (40% vs 60%)
Rapid obsolescence: The field moves so fast that specific benchmarks listed may become outdated quickly
Focus is technical: Does not deeply cover legal, governance, or policy aspects of AI safety

Reproducibility

Code: https://github.com/xingjunm/Awesome-Large-Model-Safety

The authors provide a GitHub repository (https://github.com/xingjunm/Awesome-Large-Model-Safety) containing the structured list of 574 papers categorized by the taxonomy proposed in the survey.

📊 Experiments & Results

Evaluation Setup

Bibliometric analysis of the field of AI Safety

Metrics:

Paper distribution by Model Type
Paper distribution by Attack/Defense Type
Temporal publication trends
Statistical methodology: Descriptive statistics of the collected literature (counts and percentages)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of literature distribution reveals where the research community is focusing its efforts.
Surveyed Papers	Percentage of papers on LLMs, DMs, Agents	28.68	71.32	+42.64
Surveyed Papers	Ratio of Attack vs Defense papers	40.0	60.0	+20.0

Experiment Figures

Distribution of the 574 reviewed papers across years (2018-2025), model types, and attack/defense strategies

A cross-view heatmap/chart of temporal trends across model types and attack/defense categories

Main Takeaways

Safety research has surged post-2023 (ChatGPT era), shifting focus from pure VFMs to Generative Models (LLMs, DMs) and Agents
Jailbreak attacks are the most extensively studied threat category for GenAI, reflecting the shift from 'robustness' (error rate) to 'safety' (harmful content)
Agent safety is an emerging critical frontier, introducing unique vectors like tool manipulation and memory poisoning that do not exist in static models
There is a critical need for more scalable and effective defense mechanisms, as current research is heavily skewed toward finding new attacks

📚 Prerequisite Knowledge

Prerequisites

Understanding of large model architectures (Transformers, Diffusion)
Basic knowledge of adversarial machine learning (attacks/defenses)
Familiarity with AI agents and tool use

Key Terms

VFM: Vision Foundation Models—large-scale pre-trained vision models like ViT and SAM

VLP: Vision-Language Pre-training models—models trained on image-text pairs to learn aligned representations (e.g., CLIP)

VLM: Vision-Language Models—models capable of processing and generating both images and text (e.g., GPT-4V)

DM: Diffusion Models—generative models that create data (images/audio) by reversing a noise addition process

Jailbreak: Attacks that bypass a model's safety guardrails (alignment) to elicit prohibited or harmful content

Prompt Injection: Attacks where malicious instructions are inserted into the input context to hijack the model's intended task

Backdoor: A hidden vulnerability injected during training that causes the model to behave maliciously only when a specific 'trigger' is present in the input

Adversarial Attack: Subtle, often imperceptible perturbations to input data designed to cause model error

Indirect Prompt Injection: An attack on agents where the malicious prompt is hidden in external data (e.g., a webpage) that the agent retrieves, rather than being typed by the user

Agentic AI: LLM-based systems that can autonomously plan, reason, and use tools to accomplish complex tasks