ITI: Inference-Time Intervention—a technique that shifts model activations during inference to steer behavior without changing weights
Attention Head: A component of the Transformer architecture that attends to different parts of the input sequence; ITI targets specific heads
Linear Probe: A simple classifier trained on the internal activations of a neural network to predict a property (here, truthfulness)
Residual Stream: The main vector path in a Transformer where layers read from and write to; ITI modifies activations before they merge back into this stream
TruthfulQA: A benchmark dataset designed to test whether language models mimic human falsehoods and misconceptions
RLHF: Reinforcement Learning from Human Feedback—a standard method for aligning models, which ITI aims to be a cheaper alternative to
MHA: Multi-Head Attention—the mechanism in Transformers allowing the model to focus on different positions
Activation Space: The high-dimensional vector space containing the intermediate outputs of neurons within the model