ReGap: Reward Gap—a metric measuring the difference in implicit rewards assigned to a harmless response versus a harmful response; negative values indicate misspecification
implicit reward: The effective reward a model assigns to a response, derived from the log-ratio of the aligned model's probability to the reference (base) model's probability
GCG: Greedy Coordinate Gradient—a discrete optimization method for finding adversarial suffixes by swapping tokens to minimize target loss
AutoDAN: An automated jailbreak generation method that uses a genetic algorithm and hierarchical genetic search
AdvBench: A benchmark dataset of harmful behaviors used to evaluate jailbreaking attacks
ASR: Attack Success Rate—the percentage of malicious prompts for which the model generates a harmful response
RLHF: Reinforcement Learning from Human Feedback—a method to align language models using reward models trained on human preferences
perplexity: A measurement of how well a probability model predicts a sample; in this context, used as a proxy for the fluency/readability of the adversarial prompt