prompt injection: An attack where a user inputs malicious instructions that override the AI's programming or safety guidelines
indirect prompt injection: An attack where malicious instructions are hidden in external data (e.g., a webpage or email) that the agent processes, causing it to act maliciously without the user explicitly asking
red-teaming: The practice of simulating adversarial attacks to find vulnerabilities in a system
jailbreak: A specific type of prompt injection designed to bypass safety filters and elicit forbidden content
ASR: Attack Success Rateโthe percentage of adversarial attempts that successfully cause the model to violate its policy
tool call: An action taken by an AI agent to interact with an external API or function (e.g., 'send_email', 'transfer_funds')
system prompt: The initial set of instructions given to an AI model that defines its behavior, persona, and constraints