Assistant Axis: The primary direction of variation in a model's persona space, representing the difference between the default AI Assistant identity and other character archetypes
persona drift: The phenomenon where a model unintentionally slips out of its trained Assistant character into harmful or bizarre behaviors
activation capping: A steering technique that clamps activations along a specific direction (here, the Assistant Axis) if they exceed a certain range, preventing extreme deviations
residual stream: The primary vector pathway in a Transformer where token information is processed and updated by attention and MLP layers
PC1: First Principal Component—the direction in a dataset accounting for the largest amount of variance
system prompt: An initial instruction given to an LLM to define its role, context, or behavior for the conversation
jailbreak: Adversarial prompts designed to bypass a model's safety filters, often by asking the model to roleplay a compliant persona