Stackelberg game: A strategic game where a 'leader' moves first and a 'follower' moves sequentially, used here to model the LLM anticipating user reactions
Pearl Point: The optimal response strategy that strictly adheres to safety boundaries while maximizing constructive helpfulness for a specific risk context
Lingo-BP: Linguistic Backpropagation—an optimization method that refines the model's reasoning process by propagating feedback from the Pearl Point objective back through the reasoning steps
Constructive Score: A composite metric evaluating an LLM's ability to be safe, helpful, and provide guidance, specifically for non-malicious but risky queries
Jailbreak: Adversarial attacks designed to bypass an LLM's safety filters to elicit harmful content
SFT: Supervised Fine-Tuning—training a model on a labeled dataset of inputs and desired outputs
Zero-sum game: A situation where one participant's gain is equivalent to another's loss; traditional safety is often viewed this way (safety vs. helpfulness)
Prompt injection: Attacks that modify input prompts to manipulate model behavior, often to bypass restrictions
Reasoning trajectory: The sequence of intermediate thought steps or tokens generated by the model before producing the final answer