RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective signals (like code execution success) rather than human preference labels
Linear Probe: A simple classifier (logistic regression) trained on the internal activations of a neural network to detect specific properties, such as honesty
Reward Hacking: When an AI exploits loopholes in the reward function (e.g., hardcoding a test case) to get high scores without solving the actual task
Obfuscated Policy: A strategy where the model changes its output text (e.g., adding convincing justifications) to evade a deception detector while still being deceptive
Obfuscated Activations: A strategy where the model changes its internal neural representations to evade a detector, even if the output text remains deceptively simple
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages within a group of sampled outputs to stabilize training
KL Regularization: A penalty term that prevents the trained model's policy from drifting too far from the original reference model
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights