Hallucination: The generation of false or misleading content despite the model potentially having access to correct facts.
Refusal: Safety mechanism where the model declines to answer harmful or sensitive prompts.
ITI: Inference-Time Intervention—a method to improve truthfulness by shifting activations in specific attention heads during inference.
TruthX: A method that learns a truthful latent direction via an autoencoder and applies it at inference to reduce hallucinations.
SAE: Sparse Autoencoder—an unsupervised learning model used here to decompose dense activations into sparse, interpretable features.
Contrastive Influence: A metric measuring how much a specific model component (like an attention head) contributes to a specific output (e.g., correct vs. incorrect answer) by comparing log-probabilities when that component is ablated.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.
Subspace Orthogonalization: A technique to constrain optimization so that updates do not affect a specific direction or subspace (in this case, the refusal direction).