Logit Lens: A technique to project intermediate hidden states of an LLM into the vocabulary space to see what the model is 'thinking' at a specific layer.
Dummy Tokens: Non-content control tokens in chat templates (e.g., <|start_header_id|>) where the model performs internal computation.
Zero Ablation: A causal intervention method where the activation of a specific component (neuron/head) is set to zero to test its importance.
Steering Vector: A direction in the model's activation space that, when added to hidden states, biases the model's behavior (e.g., towards honesty).
Pareto Frontier: The set of optimal trade-offs between two conflicting objectives (here, honesty vs. sales capability), where improving one requires sacrificing the other.
Liar Score: A scaled metric (0-10) quantifying the degree of deception in a response, distinguishing between gibberish, minor errors, and intentional lies.
White Lie: A lie intended to be helpful or polite.
Lie by Omission: Deception achieved by deliberately leaving out key information.