ASR: Attack Success Rate—the probability that at least one generated response in a batch of k samples violates safety guidelines
Spin-glass: A physics model of disordered magnetic systems where spins have conflicting interactions, creating a rugged energy landscape with many local minima
Replica symmetry breaking: A phase in spin glasses where the system settles into multiple disconnected low-energy states (clusters) rather than a single state
Prompt injection: Inserting a specific sequence of tokens into the model's input to bypass safety filters or steer behavior
GCG: Greedy Coordinate Gradient—an optimization-based attack method for automatically finding adversarial prompt suffixes
Langevin dynamics: A mathematical method for sampling from a complex probability distribution by simulating the movement of particles in an energy field with added noise
Gibbs measure: A probability distribution from statistical mechanics where the likelihood of a state depends exponentially on its negative energy
Magnetic field (h): In this model, an external bias applied to the system; maps to the strength or length of the injected adversarial prompt
Teacher-Student framework: Here, the 'Teacher' defines the ground truth safety landscape, and the 'Student' is the attacked model being biased by the prompt injection
Poisson-Dirichlet: A probability distribution describing the weights of clusters in the replica-symmetry-breaking phase