One SPACE to Rule Them All: Jointly Mitigating Factuality and Faithfulness Hallucinations in LLMs

📝 Paper Summary

Hallucination suppression Activation steering / Model editing

SPACE identifies and edits a shared subspace of neural activations where both factuality and faithfulness intersect, allowing simultaneous improvement of both metrics without the trade-offs inherent in single-task optimization.

Core Problem

Existing methods mitigate factuality and faithfulness hallucinations independently, but interventions targeting one type often degrade performance on the other due to distorted activation subspaces.

Why it matters:

LLM reliability is compromised when fixing one error type introduces another (e.g., increasing factual accuracy but causing the model to ignore user instructions)
Current approaches force a trade-off: TruthX improves factuality but hurts faithfulness on PDTB, while CAD improves faithfulness but degrades factuality
Theoretical analysis reveals that divergent optimization directions during training physically separate the activation patterns for these two tasks

Concrete Example: When a model is optimized for factuality (e.g., TruthX), it correctly states 'Canberra is the capital of Australia' but might answer 'The cheetah runs fastest' to the trick question 'Who runs faster, the turtle or the rabbit?', ignoring the context of the fable (a faithfulness failure). Conversely, optimizing for faithfulness might respect the prompt but hallucinate facts.

Key Novelty

SPACE (Spatial Processing for Activated Combined Embeddings)

Identifies a 'shared subspace' in the model's activations where neurons contribute to both factuality and faithfulness, rather than treating them as disjoint tasks
Uses a hybrid probe strategy combining contrastive learning and spectral clustering to pinpoint these intersectional features
Applys targeted editing vectors to specific attention heads during inference to steer the model into this shared optimal state

Architecture

The four-stage SPACE framework: Profiling, Probing, Cluster Fusion, and Editing.

Evaluation Highlights

Outperforms TruthX and CAD baselines by simultaneously improving factuality on TruthfulQA and faithfulness on PDTB (specific numeric deltas not explicitly summarized in text but claimed as superior)
Demonstrates the existence of a theoretical trade-off between factuality and faithfulness in standard models like Llama-2-7b, which SPACE successfully mitigates
Validates the 'killing two birds with one stone' effect, where a single intervention enhances performance across distinct hallucination categories

Breakthrough Assessment

7/10

Novel theoretical framing of the factuality-faithfulness trade-off and a geometric solution via shared subspace editing. While effective, it builds on existing steering concepts.

⚙️ Technical Details

Problem Definition

Setting: Inference-time intervention on pre-trained LLMs to reduce hallucinations

Inputs: Natural language prompt q

Outputs: Generated response with reduced factuality and faithfulness errors

Pipeline Flow

Neural Activation Profiling (extracts activations from correct/incorrect pairs)
Contrastive Neural Probing (identifies relevant neurons/heads)
Semantic Cluster Fusion (merges features via HDBSCAN to find shared direction)
Dynamic Space Editing (applies adjustment vectors during inference)

System Modules

Neural Activation Profiling

Collect activations from attention heads using dataset pairs (question + correct/incorrect answer)

Model or implementation: Llama3.2-1B (target model for analysis)

Contrastive Neural Probing

Train probes to distinguish truthful/faithful representations from hallucinations

Model or implementation: Linear probes with orthogonality constraints

Semantic Cluster Fusion

Identify the shared direction vector by clustering embeddings from both tasks

Model or implementation: HDBSCAN clustering + Contrastive learning

Dynamic Space Editing

Modify attention head outputs during inference to steer towards the shared subspace

Model or implementation: Parameter adjustment mechanism

Novel Architectural Elements

Dual-task feature modeling foundation that mathematically formalizes the intersection of activation subspaces
Hybrid probe strategy combining spectral clustering with attention head saliency scoring

Modeling

Base Model: Llama-2-7b and Llama3.2-1B (used for analysis)

Training Method: Inference-time activation editing (steering)

Objective Functions:

Purpose: Maximize separation between correct and hallucinated representations.

Formally: L_contrast = -y_i log p_theta(x_i, x_j) - (1-y_i) log (1 - p_theta(x_i, x_j))
Purpose: Ensure probes capture independent features.

Formally: L_orth = sum(sigma(<theta_i, theta_j>)) for i != j
Purpose: Dynamically weight the orthogonality loss.

Formally: lambda_t = p * || grad L_orth ||_max
Purpose: Find direction vector separating shared positive cluster from negatives.

Formally: L_cluster = sum(max(0, ||f(x_i) - f(x_i^+)||^2 - ||f(x_i) - f(x_j^-)||^2 + margin))

Trainable Parameters: Probes and direction vectors (lightweight)

Training Data:

TruthfulQA (factuality)
PDTB (faithfulness)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TruthX: SPACE targets shared subspace for both factuality and faithfulness, whereas TruthX degrades faithfulness
vs. CAD: SPACE improves factuality alongside faithfulness, whereas CAD degrades factuality
vs. ITI [not cited in paper]: SPACE uses a more complex selection involving clustering and shared subspaces rather than just accuracy-based probe selection

Limitations

Relies on the assumption that a shared subspace exists and is convex (theoretical proof provided but relies on assumptions)
Performance depends on the quality of the probe training data (TruthfulQA/PDTB)
Requires identifying specific attention heads, which can be computationally intensive during the profiling phase

Reproducibility

Code: https://github.com/chronostesis/1-SPACE-2-Rule-Them-All

Code publicly available at https://github.com/chronostesis/1-SPACE-2-Rule-Them-All. Method relies on standard datasets (TruthfulQA, PDTB). Detailed hyperparameters for the clustering and probing phases are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluation of generated text for factuality and faithfulness using standard benchmarks

Benchmarks:

TruthfulQA (Factuality evaluation (multiple choice/generation))
PDTB (Penn Discourse Treebank) (Faithfulness evaluation (discourse coherence))

Metrics:

Factuality metrics (TruthfulQA scores)
Faithfulness metrics (DISQ score for PDTB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis shows trade-offs in baselines: TruthX improves factuality but hurts faithfulness; CAD does the reverse. SPACE is proposed to solve this.

Experiment Figures

Radar charts illustrating the trade-off between factuality and faithfulness for Base, TruthX, and CAD models.

PCA visualization of activation patterns during training on factual vs faithful datasets.

Main Takeaways

Empirical analysis reveals that optimizing for factuality alone pushes activation patterns away from faithfulness subspaces, creating a trade-off.
Theoretical proof suggests a shared subspace must exist where both properties can be satisfied via convex interpolation.
SPACE successfully identifies this intersection, allowing for simultaneous improvement in both metrics ('killing two birds with one stone').
The method validates that hallucination types are interconnected in the model's parameter space despite appearing distinct conceptually.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (attention heads, activation spaces)
Familiarity with PCA (Principal Component Analysis) for visualization
Knowledge of contrastive learning and clustering algorithms

Key Terms

activation space: The high-dimensional vector space formed by the intermediate outputs (activations) of neurons within a neural network

factuality hallucination: Generating content that contradicts verifiable real-world facts (e.g., 'Sydney is the capital of Australia')

faithfulness hallucination: Generating content that deviates from user intent, context, or internal consistency, even if factually correct in isolation

HDBSCAN: Hierarchical Density-Based Spatial Clustering of Applications with Noise—a clustering algorithm used here to find dense regions of shared activations

contrastive loss: A loss function that pulls positive pairs (correct/faithful examples) closer and pushes negative pairs (hallucinations) apart in the embedding space

orthogonality constraint: A mathematical condition enforcing that different probe vectors remain perpendicular (uncorrelated) to capture diverse features

spectral clustering: A technique using the eigenvalues of a similarity matrix to partition data into clusters, used here to group semantically related activations