HRL: Hierarchical Reinforcement Learning—a learning approach where a 'manager' agent makes high-level decisions (goals) and a 'worker' agent executes low-level actions to achieve them
PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps using a clipped objective function
Neuron Perturbation: Temporarily modifying the activation values of specific neurons in a neural network during a forward pass, without changing the permanent weights
Causal Tracing: A method to identify which specific neurons or layers are causally responsible for a specific model output
SFT: Supervised Fine-Tuning—training a model on labeled examples
RLHF: Reinforcement Learning from Human Feedback—aligning model behavior with human preferences using reward models
ITI: Inference-Time Intervention—a technique that modifies model activations during inference to steer behavior
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
MC1 score: A metric for TruthfulQA that measures the probability of the best true answer compared to the best false answer
Integrated Gradients: An attribution method that explains the relationship between a model's predictions and its input features by integrating gradients along a path