Evaluation Illusion: A phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality
Shared Illusion: A statistically robust but epistemically shallow consensus where multiple evaluators default to the same heuristic repertoire
MERG: Metacognitive Enhanced Rubric Generation—a framework forcing evaluators to articulate domain knowledge and biases before scoring
RLAIF: Reinforcement Learning from AI Feedback—using LLMs to generate preference labels for training reward models
System 1 vs System 2: A cognitive science distinction: System 1 is fast/heuristic (intuitive), System 2 is slow/deliberative (analytical). MERG forces System 2 processing
Rubric Commensurability Problem: The finding that evaluators using independently generated rubrics have near-random agreement; much agreement comes simply from sharing the same rubric structure
Resolution Paradox: The gap where models are reliably ranked at a macro level (high Spearman correlation) but individual samples are unreliably scored (lower Pearson correlation)
Intraclass Correlation Coefficient (ICC): A statistic used to describe how strongly units in the same group resemble each other; here used to measure absolute agreement between judges, penalizing systematic scoring offsets
Base Models: Raw pretrained language models without instruction tuning
Instruct Models: Models fine-tuned to follow instructions
Thinking Models: Models trained with chain-of-thought reinforcement learning (e.g., DeepSeek-R1)