LLM-as-a-Judge: Using a Large Language Model to evaluate the outputs of another model (e.g., for correctness or safety)
3D Paradigm: Decompose, Decouple, Detach—the authors' proposed framework for evaluating analytical claims by breaking them down into linguistic components
interpretive claims: Summaries or insights that infer meaning (sentiment, intent, root cause) rather than just restating explicit facts from the text
Cohen's kappa: A statistic that measures inter-annotator agreement for categorical items, accounting for the possibility of agreement occurring by chance
F1 score: A metric balancing precision and recall, used here to measure how well LLM-judges align with human ground-truth labels
TTC: Test-Time Compute—allowing a model to generate intermediate reasoning tokens (like Chain-of-Thought) before producing a final answer to improve performance