LLM Agent: A system where an LLM acts as a decision maker to plan, invoke tools, and act in an environment while maintaining state
Red Teaming: Offensive security testing where agents act as attackers to find vulnerabilities in systems
Blue Teaming: Defensive security operations where agents monitor, detect, and respond to threats
Prompt Injection: Attacks that embed malicious instructions in the input to manipulate the model's behavior
Indirect Prompt Injection: Attacks where the agent consumes malicious content from an external source (e.g., a webpage) rather than a direct user prompt
Jailbreak: Techniques to bypass a model's safety alignment and refusal training
RAG: Retrieval-Augmented Generation—fetching external data to ground the model's responses
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps
Goal Hijacking: Attacks that alter the agent's primary objective to serve a malicious secondary goal
Reward Hacking: Exploiting flaws in a reinforcement learning reward function to maximize score without achieving the intended outcome