Pointwise Grading: Evaluating a single response on a scale (e.g., 1-10) with an explanation
Pairwise Comparison: Comparing two responses to decide which is better (win/tie/lose) with an explanation
Referenced vs. Reference-free: Whether the evaluator has access to a 'gold standard' human-written answer (reference) or must judge quality based solely on the input query
Self-Instruct: A method to bootstrap instruction-following data using an LLM to generate inputs and outputs
Pearson correlation: A statistic measuring linear correlation between two variables (here, model scores vs. human scores)
Kendall's tau: A statistic measuring the ordinal association between two measured quantities (ranking correlation)
CoT: Chain-of-Thought—prompting the model to think step-by-step before answering
Pseudo Reference: A high-quality response generated by a strong model (GPT-4) and manually verified, used as a substitute for human ground truth during training data creation