MARL: Multi-Agent Reinforcement Learning—training multiple agents (here, attacker and defender roles) continuously in a shared environment
Nash Equilibrium: A state in a game where no player can benefit by changing their strategy while the other players keep theirs unchanged; in this context, implies the defender is robust to any attack
Hidden Chain-of-Thought: A reasoning process where the model generates a thought trace (e.g., <think>...</think>) that is used for internal planning but masked from the opponent/user
SBERT: Sentence-BERT—a modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings
Re++: A lightweight variant of the PPO algorithm designed for efficiency and stability in LLM training, avoiding costly value modeling
SFT: Supervised Fine-Tuning—training a model on labeled examples, used here as an auxiliary loss to maintain conversational quality during RL
ASR: Attack Success Rate—the percentage of adversarial prompts that successfully elicit a harmful response from the target model
Cold-start: Starting the training process without prior specific tuning for the task, relying on the RL process to discover strategies from scratch