pff: Parametric Faithfulness Framework—the general methodology of intervening on model parameters to test reasoning faithfulness
fur: Faithfulness by Unlearning Reasoning steps—the specific instance of pff using NPO to unlearn steps
NPO: Negative Preference Optimization—an unlearning loss function that discourages the model from generating specific 'forget' sequences
CoT: Chain of Thought—intermediate reasoning steps generated by a model before its final answer
Parametric Faithfulness: Whether the reasoning chain accurately reflects the internal computations (parameters) used to derive the answer
Contextual Faithfulness: Whether the model's answer is consistent with the provided reasoning context (measured by editing the prompt)
MCQA: Multi-choice Question Answering—the task format used for evaluation
ff-hard: A binary metric indicating if unlearning a reasoning chain causes the model's answer to flip
ff-soft: A continuous metric measuring the probability mass shift away from the original answer after unlearning