Instruction Hierarchy (IH): A policy defining how LLMs prioritize conflicting instructions based on role authority (System > Developer > User > Tool)
IF-simple: Instruction Following-simple; tasks designed to be easy for a capable model to solve absent adversarial conflicts
Jailbreak: An attack where a user prompts the model to violate its safety guidelines or system instructions
Prompt Injection: An attack where untrusted content (e.g., from a tool output or website) overrides user or system instructions
Overrefusal: A failure mode where the model refuses benign requests due to overly conservative safety heuristics
RL: Reinforcement Learning—training models to maximize a reward signal
OOD: Out-Of-Distribution—tasks or data types not seen during training
System Message: High-priority instructions provided by the model developer/admin
Attacker Model: A frozen LLM used to generate adversarial prompts that attempt to trick the defender model
Defender Model: The model being trained (fine-tuned) to robustly follow the Instruction Hierarchy