hallucination: When an LLM generates content that is factually incorrect or unfaithful to the input source, often confidently
semantic equivalence: A relationship where two prompts share the exact same meaning, intent, and ground-truth answer, despite lexical differences
semantic coherence: The quality of a text being grammatically correct, fluent, and meaningful to a human reader (measured here via perplexity)
zeroth-order optimization: Optimization methods that do not require gradient information (derivatives), relying instead on function evaluations (e.g., querying the LLM)
GCG: Greedy Coordinate Gradient—a gradient-based attack method that optimizes discrete tokens to force model behavior, often resulting in gibberish suffixes
ASR@K: Attack Success Rate at K—the percentage of samples for which at least one successful attack is found within K attempts
TTR: Type-Token Ratio—a measure of lexical diversity calculated as the ratio of unique tokens to total tokens
perplexity: A metric measuring how surprised a model is by a sequence of text; lower perplexity indicates more natural/coherent text