The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

📝 Paper Summary

Hallucination mitigation Safety alignment Mechanistic interpretability

Increasing factual accuracy in LLMs inadvertently weakens safety refusals because hallucination and refusal behaviors share overlapping representations in specific attention heads, which can be mitigated by disentangling these features.

Core Problem

Techniques aimed at improving truthfulness (reducing hallucinations) often inadvertently degrade safety alignment, causing models to answer harmful queries they previously refused.

Why it matters:

Improving model utility (factuality) currently comes at the cost of compromising safety guardrails, creating a dangerous zero-sum game in alignment.
Even fine-tuning on benign datasets can erode refusal mechanisms due to internal feature overlap, making models vulnerable to jailbreaks.
Existing methods treat hallucination and safety as separate optimization problems, ignoring the mechanistic interference between them.

Concrete Example: When a model is steered to be more truthful using methods like ITI or TruthX, it provides more accurate answers on benchmarks like TruthfulQA but simultaneously achieves higher attack success rates on harmful prompts from AdvBench (e.g., providing instructions for illegal acts instead of refusing).

Key Novelty

Disentangled Safety-Truthfulness Fine-Tuning via Sparse Autoencoders

Identifies that 'hallucination heads' and 'refusal heads' significantly overlap; suppressing hallucination via standard methods unintentionally suppresses refusal mechanisms.
Uses Sparse Autoencoders (SAEs) to decompose attention head activations into distinct features, isolating 'refusal' directions from 'hallucination' directions.
Applies subspace orthogonalization during fine-tuning to update model weights for utility/truthfulness while mathematically constraining updates to preserve the refusal subspace.

Architecture

Conceptual illustration of the entanglement problem and the proposed solution. Left: Overlapping heads encode both refusal and hallucination. Right: SAEs disentangle these features, enabling orthogonal fine-tuning.

Evaluation Highlights

Standard truthfulness interventions (ITI, TruthX) increase jailbreak success rates on StrongReject and AdvBench, confirming the negative trade-off.
Steering along a 'negative hallucination direction' (to improve truthfulness) improves TruthfulQA performance but simultaneously increases attack success rates on harmful prompts.
The proposed method preserves refusal behavior while improving task utility, mitigating the trade-off observed in baselines.

Breakthrough Assessment

8/10

Identifies a critical, overlooked mechanism (feature entanglement) connecting two major alignment goals (safety and truthfulness) and proposes a mechanistic solution (SAE-guided disentanglement) to resolve it.

⚙️ Technical Details

Problem Definition

Setting: Aligning LLMs to be both truthful (low hallucination) and safe (high refusal) simultaneously.

Inputs: Prompts $x$ that may be factual queries or harmful requests.

Outputs: Responses $y$ that are factually correct for benign queries and refusals for harmful queries.

Pipeline Flow

Behavior Analysis (Identify Trade-off)
Mechanism Localization (Identify Overlapping Heads)
Feature Disentanglement (Train SAEs)
Constrained Fine-Tuning (Update weights while preserving refusal)

System Modules

Head Contrastive Analysis

Identify attention heads responsible for hallucination and refusal.

Model or implementation: LLaMA-3-8B-Instruct

Sparse Autoencoder (SAE)

Decompose activations of overlapping heads to isolate refusal features.

Model or implementation: Sparse Autoencoder trained on head activations

Constrained Fine-Tuning

Fine-tune the model for utility while preventing degradation of refusal features.

Model or implementation: LLaMA-3-8B-Instruct

Novel Architectural Elements

Integration of SAE-derived feature constraints directly into the fine-tuning loop to disentangle specific behavioral subspaces (refusal vs. hallucination).

Modeling

Base Model: LLaMA-3-8B-Instruct

Training Method: SAE-guided constrained fine-tuning

Objective Functions:

Purpose: Minimize reconstruction error of head activations while enforcing sparsity.

Formally: SAE loss (reconstruction + L1 sparsity penalty).
Purpose: Fine-tune model on benign data while preserving refusal directions.

Formally: Standard cross-entropy loss with gradient projection ensuring updates are orthogonal to refusal subspace.

Adaptation: LoRA (rank=1) used for initial direction finding; Full/LoRA fine-tuning for final mitigation (implied)

Trainable Parameters: Attention head parameters (targeted updates)

Training Data:

TruthfulQA (for truthfulness directions)
AdvBench and StrongReject (for safety evaluation)
Benign datasets for fine-tuning trade-off analysis

Key Hyperparameters:

LoRA_rank: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. ITI/TruthX: These methods improve truthfulness but degrade safety (increase jailbreak success). The proposed method mitigates this trade-off.
vs. RepE [not cited in paper]: RepE typically steers for a single concept. This paper addresses the entanglement of two conflicting concepts (truthfulness and refusal) within the same components.
Novelty: First to mechanistically link hallucination and refusal to shared attention heads and use SAEs to disentangle them during fine-tuning.

Limitations

Analysis primarily focuses on LLaMA-3-8B-Instruct; generalization to other architectures is not extensively tested in the main text.
Relies on the quality of the SAE decomposition; imperfect disentanglement could still lead to safety leaks.
Computational cost of training SAEs for multiple heads could be significant.

Reproducibility

Code: https://github.com/OmarMohammed88/Hall_Refusal

Code is publicly available at https://github.com/OmarMohammed88/Hall_Refusal. The paper uses standard open models (LLaMA-3-8B-Instruct) and benchmarks (TruthfulQA, AdvBench, StrongReject). Exact hyperparameters for the SAE training and the final constrained fine-tuning process are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Measuring the impact of truthfulness interventions on safety and vice versa.

Benchmarks:

TruthfulQA (Factuality/Hallucination)
AdvBench (Safety/Refusal (Harmful prompts))
StrongReject (Safety/Refusal (Adversarial/Strong attacks))

Metrics:

Attack Success Rate (ASR) / Jailbreak Susceptibility
MC1/MC2 scores (TruthfulQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Steering towards truthfulness (negative hallucination direction) improves factual accuracy but degrades safety.
TruthfulQA	MC1	0.28	0.33	+0.05
TruthfulQA	MC2	0.43	0.50	+0.07
TruthfulQA	MC1	0.28	0.19	-0.09
Ablating refusal heads (which overlap with hallucination heads) significantly breaks safety mechanisms.
Harmful Benchmarks	Attack Success Rate	0.0	0.95	+0.95

Experiment Figures

Bar charts showing the impact of ITI and TruthX on both TruthfulQA (Factuality) and Attack Success Rate (Safety).

Heatmaps or scatter plots of Attention Head Dynamics before and after LoRA steering.

Main Takeaways

There is a measurable trade-off between truthfulness and safety: interventions that increase TruthfulQA scores (ITI, TruthX, LoRA steering) consistently increase Attack Success Rates on AdvBench and StrongReject.
Hallucination and refusal behaviors are mechanistically entangled: they share a subset of attention heads (overlapping heads).
Suppressing 'hallucination heads' (to improve factuality) inadvertently suppresses 'refusal heads', leading to weakened safety guardrails.
Standard fine-tuning on benign data can degrade safety because updates that reduce hallucination also impact the shared refusal subspace.
Disentangling these features using SAEs allows for improvements in truthfulness without the associated safety penalty.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Attention Mechanisms
Familiarity with Mechanistic Interpretability (Attention Heads, probing)
Knowledge of Safety Alignment (Refusal, Jailbreaking)
Basics of Sparse Autoencoders (SAEs)

Key Terms

Hallucination: The generation of false or misleading content despite the model potentially having access to correct facts.

Refusal: Safety mechanism where the model declines to answer harmful or sensitive prompts.

ITI: Inference-Time Intervention—a method to improve truthfulness by shifting activations in specific attention heads during inference.

TruthX: A method that learns a truthful latent direction via an autoencoder and applies it at inference to reduce hallucinations.

SAE: Sparse Autoencoder—an unsupervised learning model used here to decompose dense activations into sparse, interpretable features.

Contrastive Influence: A metric measuring how much a specific model component (like an attention head) contributes to a specific output (e.g., correct vs. incorrect answer) by comparing log-probabilities when that component is ablated.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.

Subspace Orthogonalization: A technique to constrain optimization so that updates do not affect a specific direction or subspace (in this case, the refusal direction).