WRVRT: Writing, Reviewing, Validating, Revising, and Testing—a proposed iterative workflow for developing reliable educational prompts
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Binomial/Trinomial Scoring: Scoring systems with two levels (Correct/Incorrect) or three levels (Beginning/Developing/Proficient)
QWK: Quadratic Weighted Kappa—a metric measuring agreement between raters (or AI and human) that penalizes large disagreements more heavily than small ones
Greedy Sampling: A decoding strategy where the model always selects the highest probability next token (Temperature=0), ensuring deterministic outputs
Nucleus Sampling: A decoding strategy (Top-p) that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p, allowing for diversity
Item Stem: The main part of a test question that presents the problem or task to the student
Rubric: A scoring guide used to evaluate the quality of students' constructed responses