LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

📝 Paper Summary

Hallucination suppression Inference-time intervention

LLM-CAS trains a hierarchical reinforcement learning agent to dynamically select and apply temporary perturbations to specific neuron activations during inference to correct hallucinations without permanent parameter changes.

Core Problem

Static model editing methods struggle with context-dependent hallucinations and often damage unrelated knowledge (catastrophic forgetting), while existing dynamic methods rely on heuristic rules lacking adaptability.

Why it matters:

Hallucinations prevent reliable deployment of LLMs in mission-critical applications where factual accuracy is paramount
Retraining or full-model fine-tuning (RLHF/SFT) is computationally expensive and data-intensive
Permanent parameter edits are brittle; correcting one fact often breaks others or degrades general capabilities

Concrete Example: When an LLM is asked a question that elicits a hallucination (e.g., a wrong date for an event), static editing permanently changes weights to fix this but might break answers for similar but distinct events. LLM-CAS instead applies a temporary 'patch' to neuron activations only for that specific query context.

Key Novelty

Hierarchical Reinforcement Learning for Real-Time Neuron Perturbation

Frames hallucination correction as a hierarchical decision process: a high-level policy selects a functional network region, and a low-level policy determines the specific perturbation type and magnitude
Utilizes a 'learnable dynamic mask' combined with input-specific causal tracing to pinpoint exactly which neurons to perturb for a given input, ensuring interventions are sparse and targeted

Architecture

The Dynamic Neuron Perturbation framework architecture and data flow.

Evaluation Highlights

+10.98 percentage points accuracy improvement on StoryCloze compared to the base model
+2.71 percentage points accuracy improvement on TriviaQA compared to the base model
+2.06 percentage points improvement on TruthfulQA (MC1 score) compared to the base model

Breakthrough Assessment

7/10

Applies Hierarchical RL to inference-time intervention in a novel way, outperforming static editing and heuristic dynamic methods. The integration of causal tracing with learned policies is promising.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for real-time hallucination correction during inference

Inputs: Input sequence x producing a potentially hallucinated output

Outputs: Corrected output y_c with minimized hallucination

Pipeline Flow

Input Processing: Encoder → State Construction
Hierarchical Decision: High-Level Policy (Target Selection) → Low-Level Policy (Perturbation Details)
Intervention: Adaptive Masking → Activation Perturbation → Decoder

System Modules

State Encoder

Constructs the state vector from input embeddings and baseline performance scores

Model or implementation: Not specified (likely uses the target LLM's own embeddings)

High-Level Agent (Hierarchical Decision)

Selects the macro-level target functional network (category of neurons) for intervention

Model or implementation: MLP Policy Network

Low-Level Agent (Hierarchical Decision)

Determines the specific perturbation type and magnitude given the high-level target

Model or implementation: MLP Policy Network

Adaptive Perturbation Localization

Computes the final perturbation vector using a learnable mask and causal tracing attributions

Model or implementation: Learnable Mask + Integrated Gradients

LLM Response Evaluation

Scores the corrected output to generate rewards for the RL agent

Model or implementation: Llama2-7B-Instruct (as Judge)

Novel Architectural Elements

Hierarchical dual-policy structure (High-Level for target region, Low-Level for perturbation parameters)
Two-stage adaptive masking mechanism combining a learnable general sparse mask with input-specific causal attribution scores

Modeling

Base Model: Evaluated on various LLMs (Specific base model for the agent not explicitly isolated from target, but target LLMs include Llama2-7B-Instruct)

Training Method: Hierarchical Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Optimize the policy to maximize expected reward while ensuring stability.

Formally: PPO clipped surrogate objective maximizing expected advantage.
Purpose: Train the value functions to estimate expected returns.

Formally: Squared error loss between predicted value and actual return.
Purpose: Encourage exploration.

Formally: Entropy bonus term added to the loss.
Purpose: Learn a sparse, effective mask for neuron selection.

Formally: Loss = -Reward + L1_sparsity + L0_sparsity (approx)

Key Hyperparameters:

method: PPO
optimization_algorithm: Adam (implied)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MEMIT/ROME: LLM-CAS uses temporary, context-dependent activation perturbations rather than permanent weight modifications
vs. ITI/CAA: LLM-CAS learns a dynamic policy to adjust interventions per input, whereas ITI/CAA use static vectors
vs. SADI: LLM-CAS employs a formal Hierarchical RL framework to learn a complex policy, whereas SADI relies on simpler optimization or pre-defined rules

Limitations

Reliance on an LLM-based judge (Llama2-7B-Instruct) for rewards may introduce bias or errors in evaluation
Computational overhead of calculating causal traces (Integrated Gradients) during inference is not explicitly quantified but likely non-negligible
Requires 'bad case' examples for training the correction policy

Reproducibility

Code availability is stated as 'We will open-source our method', implying it is not yet released. The paper uses Llama2-7B-Instruct as a judge. Specific hyperparameters for training (learning rates, batch sizes) are referenced to Appendix H (not provided in input text).

📊 Experiments & Results

Evaluation Setup

Hallucination correction across multiple benchmarks assessing accuracy and truthfulness

Benchmarks:

StoryCloze (Commonsense reasoning / Story completion)
TriviaQA (Question Answering)
TruthfulQA (Truthfulness evaluation)

Metrics:

Accuracy
MC1 score (for TruthfulQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
StoryCloze	Accuracy	Not reported in the paper	Not reported in the paper	+10.98
TriviaQA	Accuracy	Not reported in the paper	Not reported in the paper	+2.71
TruthfulQA	MC1 Score	Not reported in the paper	Not reported in the paper	+2.06

Main Takeaways

LLM-CAS achieves significant performance gains across StoryCloze, TriviaQA, and TruthfulQA compared to the base model.
The method outperforms static editing methods (like ITI and CAA) and the dynamic SADI framework (claims explicit superiority in text, though exact delta vs SADI not in snippet).
The hierarchical approach allows for flexible addressing of diverse hallucination types while minimizing unintended side effects compared to static editing.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO)
Transformer architecture (neuron activations)
Model editing concepts (locate-then-edit)

Key Terms

HRL: Hierarchical Reinforcement Learning—a learning approach where a 'manager' agent makes high-level decisions (goals) and a 'worker' agent executes low-level actions to achieve them

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps using a clipped objective function

Neuron Perturbation: Temporarily modifying the activation values of specific neurons in a neural network during a forward pass, without changing the permanent weights

Causal Tracing: A method to identify which specific neurons or layers are causally responsible for a specific model output

SFT: Supervised Fine-Tuning—training a model on labeled examples

RLHF: Reinforcement Learning from Human Feedback—aligning model behavior with human preferences using reward models

ITI: Inference-Time Intervention—a technique that modifies model activations during inference to steer behavior

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

MC1 score: A metric for TruthfulQA that measures the probability of the best true answer compared to the best false answer

Integrated Gradients: An attribution method that explains the relationship between a model's predictions and its input features by integrating gradients along a path