d' type2: A metric from signal detection theory measuring the distance between internal confidence distributions for correct vs. incorrect judgments; quantifies metacognitive sensitivity independent of response bias
Evolution Strategies (ES): A gradient-free optimization technique that refines parameters by evaluating a population of perturbed candidates, allowing optimization of non-differentiable, holistic reward functions
Dual-prompt method: An evaluation framework using Direct Questions (factual answer) and Meta Questions (self-evaluation) in independent contexts to measure knowledge-behavior consistency
ESMA: Evolution Strategy for Metacognitive Alignment—the proposed method to bind internal knowledge to explicit behavior
Type 2 AUCROC: Area Under the Receiver Operating Characteristic Curve for Type 2 tasks (confidence judgments), measuring the ability to distinguish correct from incorrect responses based on confidence
Raw Alignment: Simple accuracy of the meta-response matching the correctness of the direct answer (susceptible to bias)
YFR: Yes Failure Ratio—proportion of times the model claims to know the answer but gets it wrong
NFR: No Failure Ratio—proportion of times the model claims not to know but could have answered correctly