linear representation hypothesis: The theory that high-level concepts (like truth or sentiment) are represented as linear directions (vectors) in the model's activation space
residual stream: The primary vector pathway in a Transformer model where attention and MLP layers add their outputs; often used for probing internal states
margin score: A metric measuring how well a probe separates positive (e.g., factual) and negative answers. Positive = correct separation, Negative = inverted separation
CCS: Contrast-Consistent Search—an unsupervised method to find truth directions by looking for representations that are consistent across negations
on-policy: Data generated by the model itself during interaction
off-policy: Data generated by a different source (human or another model) and fed to the model as context
SAE: Sparse Autoencoder—an interpretability method that decomposes model activations into sparse, interpretable features
jailbreaking: Techniques to bypass model safety filters, often using complex prompts or role-play
ply: A single message from either the user or the model (half a conversation turn)