Value Neurons: A small subset of neurons within LLM hidden states that encode the model's internal expectation of the current state's value/correctness
Dopamine Neurons: Neurons that encode Reward Prediction Error (RPE), exhibiting high activation when outcomes are better than expected and low activation when worse
RPE: Reward Prediction Error—the difference between the actual reward received and the expected reward
TD Learning: Temporal Difference Learning—an RL method used here to train the probe, updating value estimates based on the difference between current and future predictions
AUC: Area Under the ROC Curve—a metric used here to measure how well the value probe distinguishes between correct and incorrect generation paths
IoU: Intersection over Union—a metric measuring the overlap between sets of identified value neurons across different datasets or models
Sparsity: The property that only a very small percentage (e.g., <1%) of neurons are responsible for a specific function (here, value estimation)