metastable: A state that appears stable but is easily destabilized by small perturbations, eventually settling into a lower-energy (more stable) state
attractor: A set of states toward which a system tends to evolve; here, honesty is framed as a stable attractor in the representation space
SLERP: Spherical Linear Interpolation—a method for interpolating between two vectors lying on a hypersphere (common in latent space analysis)
token-forcing: Forcing the model to output a specific token (or evaluating the probability of that token) at a specific step, rather than letting it sample freely
DailyDilemmas: An existing dataset of everyday moral scenarios, augmented in this paper to include variable costs
DoubleBind: A novel dataset introduced in this paper featuring realistic moral trade-offs where honesty incurs variable, explicitly stated costs
facsimile problem: The phenomenon where models mimic the appearance of human reasoning (e.g., moral deliberation) without the internal process actually driving the final decision
Jaccard index: A statistic used for gauging the similarity and diversity of sample sets; here used to measure overlap of scenarios benefiting from reasoning across models