Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

📝 Paper Summary

Adversarial Attacks on LLMs Safety Alignment Scaling Laws

The success rate of jailbreaking LLMs transitions from slow polynomial growth to fast exponential growth as the length of the injected adversarial prompt increases, explained by a phase transition in a spin-glass theoretical model.

Core Problem

Safety-aligned LLMs can be 'jailbroken' to produce harmful content, but it is unknown how the attack success rate scales with the number of inference-time samples under prompt injection.

Why it matters:

Understanding scaling laws is crucial for predicting the safety risks of future, more capable models deployed with extensive inference budgets
Current defenses often assume polynomial scaling of attacks, potentially underestimating the threat of exponential success rates driven by prompt injection
Reveals a fundamental vulnerability in alignment: strong adversarial perturbations can qualitatively change the model's generation landscape from disordered to ordered unsafe states

Concrete Example: When a user asks for a 'cocktail' recipe, a model refuses. If an attacker injects a long adversarial suffix and samples the model 100 times, the probability of getting at least one harmful recipe increases exponentially rather than polynomially, drastically reducing the cost of a successful attack.

Key Novelty

SpinLLM: A Spin-Glass Model for Jailbreaking Dynamics

Models LLM generation as a spin-glass system where 'safe' and 'unsafe' outputs correspond to low-energy clusters in a rugged energy landscape
Jailbreak prompts act as an external 'magnetic field' tilting the landscape toward unsafe clusters
Predicts a phase transition: weak fields (short prompts) yield polynomial scaling of attack success, while strong fields (long prompts) trigger an ordered phase leading to exponential scaling

Architecture

Conceptual illustration of the SpinLLM energy landscape. Safe and unsafe generations correspond to different clusters (valleys) in the energy landscape.

Evaluation Highlights

Empirically confirms the transition from polynomial to exponential scaling of Attack Success Rate (ASR) on Llama-2-7B and Vicuna-7B v1.5
Demonstrates that stronger models like GPT-4.5 Turbo maintain polynomial scaling (slower ASR growth) under attacks that cause exponential scaling in weaker models
Analytically derives the specific power-law exponents and exponential decay rates based on spin-glass parameters (temperature, magnetic field)

Breakthrough Assessment

8/10

Provides a rigorous theoretical grounding (spin-glass physics) for empirical scaling laws in adversarial attacks, successfully predicting a distinct phase transition in safety behavior.

⚙️ Technical Details

Problem Definition

Setting: Adversarial attack on a generative model where an attacker injects a prompt to maximize the probability that at least one of k sampled outputs is unsafe

Inputs: Benign prompt x combined with an adversarial injected prompt (magnetic field h)

Outputs: Sequence of tokens (spins) σ generated from the model's distribution

Pipeline Flow

Input Processing (Prompt + Adversarial Suffix)
Energy Landscape Mapping (Teacher defines safe/unsafe clusters)
Biased Sampling (Student generates samples under magnetic field h)
Safety Evaluation (Check if samples hit unsafe clusters)

System Modules

Teacher Model

Defines the ground truth energy landscape and classifies low-energy clusters as safe or unsafe

Model or implementation: p-spin glass model with Gaussian disorder

Student Model

Generates responses under the influence of the adversarial prompt (magnetic field)

Model or implementation: p-spin glass model with added magnetic field h

Novel Architectural Elements

SpinLLM: Mapping LLM token generation to sampling from a p-spin glass model in the replica-symmetry-breaking phase
Modeling prompt injection as an external magnetic field h aligned with unsafe cluster centers

Modeling

Base Model: Llama-2-7B-Chat, Vicuna-7B-v1.5, GPT-4.5 Turbo (for empirical validation)

Training Method: Theoretical modeling verified by inference-time experiments

Adaptation: None (inference-only analysis)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. Hughes et al. (2024): Extends scaling analysis to include prompt injection, identifying a new exponential scaling regime
vs. GCG: Uses GCG as a tool to generate attacks but focuses on the scaling laws of success rather than the attack method itself
vs. Arditi et al. (2024): Provides a statistical physics theoretical framework for why steering works, rather than an empirical activation analysis

Limitations

The spin-glass model is a theoretical abstraction (mean-field approximation) and does not capture the full complexity of transformer attention mechanisms
Analysis relies on the large-N (thermodynamic) limit, while real tokens sequences are finite
Assumes well-trained student model where intrinsic parameters match the teacher, ignoring modeling errors
Empirical validation is limited to specific open-source models and may not generalize to all architectures

Reproducibility

Code: https://github.com/indranilhalder/SpinLLM

Code for reproducing the experiments is available at https://github.com/indranilhalder/SpinLLM. The theoretical derivation is fully detailed in the paper. Empirical validation uses open weights models (Llama-2, Vicuna) and public datasets (AdvBench).

📊 Experiments & Results

Evaluation Setup

Jailbreaking attacks on LLMs using the AdvBench dataset

Benchmarks:

AdvBench (Adversarial Prompting)

Metrics:

Attack Success Rate (ASR) @ k (probability of ≥1 success in k samples)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Empirical validation of scaling laws on various LLMs shows distinct behaviors for weak vs. strong models.
AdvBench	ASR scaling trend	Linear fit	Non-linear deviation	Exponential correction

Experiment Figures

Scaling of Attack Success Rate (ASR) with number of samples k for GPT-4.5 Turbo vs. Vicuna-7B v1.5.

Main Takeaways

Short injected prompts (weak field) result in polynomial scaling of Attack Success Rate (ASR) with sample count k.
Long injected prompts (strong field) cause a phase transition to exponential scaling of ASR, making attacks significantly cheaper.
Stronger models (like GPT-4.5 Turbo) are more resistant to this phase transition, maintaining polynomial scaling even under attack conditions that break weaker models.
The theoretical SpinLLM model qualitatively matches empirical observations, validating the link between 'ordered' spin phases and jailbreak susceptibility.

📚 Prerequisite Knowledge

Prerequisites

Statistical mechanics (Ising models, spin glasses)
Replica symmetry breaking
Large Language Model safety alignment
Poisson-Dirichlet processes

Key Terms

ASR: Attack Success Rate—the probability that at least one generated response in a batch of k samples violates safety guidelines

Spin-glass: A physics model of disordered magnetic systems where spins have conflicting interactions, creating a rugged energy landscape with many local minima

Replica symmetry breaking: A phase in spin glasses where the system settles into multiple disconnected low-energy states (clusters) rather than a single state

Prompt injection: Inserting a specific sequence of tokens into the model's input to bypass safety filters or steer behavior

GCG: Greedy Coordinate Gradient—an optimization-based attack method for automatically finding adversarial prompt suffixes

Langevin dynamics: A mathematical method for sampling from a complex probability distribution by simulating the movement of particles in an energy field with added noise

Gibbs measure: A probability distribution from statistical mechanics where the likelihood of a state depends exponentially on its negative energy

Magnetic field (h): In this model, an external bias applied to the system; maps to the strength or length of the injected adversarial prompt

Teacher-Student framework: Here, the 'Teacher' defines the ground truth safety landscape, and the 'Student' is the attacked model being biased by the prompt injection

Poisson-Dirichlet: A probability distribution describing the weights of clusters in the replica-symmetry-breaking phase