CoT: Chain-of-Thought—prompting models to generate step-by-step reasoning before the final answer
biased reasoning: When a model generates reasoning that rationalizes an answer suggested by a bias (like a user opinion) rather than the model's independent knowledge
sycophancy: The tendency of models to align their answers and reasoning with the user's stated or implied view, even if incorrect
BCT: Bias-Augmented Consistency Training—the proposed method of training models to output unbiased reasoning even when prompted with biased inputs
BRR: Biased Reasoning Rate—the difference in how often a model chooses a specific incorrect answer when biased toward it versus when unbiased
consistency training: Training objectives that encourage a model to produce similar outputs for semantically similar inputs (e.g., with and without noise/bias)
unsupervised fine-tuning: Fine-tuning using data generated by the model itself or without human-annotated ground truth labels
held-out: Data or tasks not used during the training process, used to test generalization