Q-Anchored Pathway: A truthfulness encoding mechanism that relies heavily on information flow from the question's exact tokens to the answer
A-Anchored Pathway: A truthfulness encoding mechanism where signals are derived primarily from the generated answer itself, independent of the question
Exact tokens: Core frame elements in the text, such as the specific subject and property in a question or the critical entity in an answer
Attention knockout: A technique to block information flow by setting specific attention weights to zero during inference
Token patching: Replacing specific tokens in the input with tokens from a different sample to test causal effects on model representations
Saliency analysis: A method to measure the importance of specific input features (like attention weights) for a model's output or loss
Mixture-of-Probes (MoP): A proposed detection method using specialized classifiers (experts) for different truthfulness pathways
Pathway Reweighting (PR): A proposed method that modulates internal activation intensity to emphasize signals most relevant to truthfulness detection
AUC: Area Under the Curve—a performance metric for classification tasks, measuring the ability to distinguish between classes