final hidden state: The hidden state of the last input token in the last transformer layer, which directly influences the next token prediction.
awareness score: A metric defined as cos(s1, s2) - cos(s1, s3), quantifying how differently the model's state changes when exposed to a hallucinated vs. correct answer.
activation engineering: A technique involving the addition of specific vectors (offsets) to the hidden states of a model during inference to steer its behavior (e.g., towards truthfulness).
PCA: Principal Component Analysis—a dimensionality reduction technique used here to find the primary 'direction' of difference between correct and hallucinated hidden state transitions.
teacher-forcing: Feeding the model the actual ground truth (or specific target text) as input history, rather than its own generated output, to inspect its internal reaction to that specific text.
pro-prompting: Using prompts that encourage confidence for correct inputs and discourage confidence for hallucinated inputs to test awareness sensitivity.
anti-prompting: Using prompts that discourage confidence for correct inputs and encourage confidence for hallucinated inputs.