Applying Large Language Models and Chain-of-Thought for Automatic Scoring

📝 Paper Summary

AI in Education Automatic Essay Scoring

Combining Chain-of-Thought prompting with specific scoring rubrics and item context significantly improves GPT-4's accuracy in scoring student science explanations compared to standard zero-shot approaches.

Core Problem

Standard automatic scoring models require extensive labeled training data and technical expertise, while generic LLM prompts often fail to grasp the specific nuances of complex scientific scoring rubrics.

Why it matters:

Developing traditional supervised scoring models is labor-intensive and technically inaccessible for many educators and researchers
Standard 'black box' AI scoring lacks transparency, making it difficult for teachers to trust grades or provide actionable feedback to students
Generic LLM scoring often hallucinates criteria or misses specific evidence requirements defined in educational standards

Concrete Example: In Task H4_2 (a trinomial science question), a standard Zero-Shot CoT prompt correctly identified 'Proficient' students only 27.5% of the time. It failed because it generated its own reasoning path rather than following the specific grading rubric. By adding the Context and Rubric (CR) to the CoT prompt, accuracy on 'Proficient' students jumped to 68.33%.

Key Novelty

WRVRT Framework & Context-Aware CoT

Proposes WRVRT (Writing, Reviewing, Validating, Revising, Testing), an iterative prompt engineering workflow specifically designed for educational validity
Demonstrates that Chain-of-Thought (CoT) alone is ineffective for scoring; it requires explicit 'Context and Rubric' (CR) constraints to align model reasoning with pedagogical standards

Evaluation Highlights

Few-shot learning achieved 66.98% average accuracy across six tasks, outperforming zero-shot learning (59.50%) by 12.6%
Adding Context and Rubric (CR) to Zero-Shot Chain-of-Thought prompts increased accuracy by 13.44% (from 0.5532 to 0.6831)
GPT-4 with greedy sampling outperformed GPT-3.5 by 8.64% (0.6975 vs 0.6111) when using the best performing Few-Shot CoT + CR prompt

Breakthrough Assessment

4/10

A solid application paper establishing best practices for prompt engineering in educational assessment. While not architecturally novel, the finding that CoT fails without rubric constraints is practically valuable.

⚙️ Technical Details

Problem Definition

Setting: Multi-class classification of short-answer student responses based on specific science rubrics

Inputs: Student response text, Item Stem (question context), Scoring Rubric, (Optional) Few-shot examples

Outputs: Proficiency Label (Beginning, Developing, Proficient) and Explanation

Pipeline Flow

Prompt Construction (WRVRT) -> API Call (GPT-3.5/4) -> Sampling Strategy (Greedy/Voting) -> Output Label Extraction

System Modules

Prompt Constructor

Assembles the prompt components: Role (BasicRole), Context/Rubric (ContRubTEXT), Examples (FewEXAMPLES), and CoT Initiator

Model or implementation: Rule-based string concatenation

Inference Engine

Generates the score and reasoning

Model or implementation: GPT-4 or GPT-3.5-turbo

Voter/Aggregator

Determines final label when using ensemble/nucleus sampling

Model or implementation: Majority Vote Logic

Modeling

Base Model: GPT-4 and GPT-3.5-turbo

Key Hyperparameters:

greedy_temperature: 0.0
greedy_top_p: 0.01
nucleus_temperature: 0.9
+ 2 more
nucleus_top_p: 0.95
voting_ensemble_size: 3 calls

Compute: Not reported in the paper

Comparison to Prior Work

vs. Supervised Learning (BERT/ML): Does not require model training or large labeled datasets; relies on prompt engineering [not cited in paper]
vs. Standard Zero-Shot: Incorporates specific educational rubrics and item stems directly into the prompt context
vs. Zero-Shot CoT (Generic): Adds domain constraints (Rubrics) preventing the model from hallucinating irrelevant reasoning paths

Limitations

Performance varies significantly by item type; trinomial tasks remain challenging (acc ~0.60)
Relies on proprietary models (GPT-4), raising cost and privacy concerns for schools
Few-shot examples require expert selection and CoT annotation, which is not fully automated
No direct comparison to fine-tuned open-source models (like Llama 2 or Mistral) in the experiments

Reproducibility

No code repository provided. Prompt templates are described in detail in the text and Appendix 1 (Figure 3 in text, Appendix listing components). Dataset is a secondary analysis of Zhai, He, & Krajcik (2022) but specific subsets not explicitly released.

📊 Experiments & Results

Evaluation Setup

Automatic scoring of middle school science written responses (short answer)

Benchmarks:

NGSS Assessment Dataset (Scientific Argumentation/Explanation (Binomial and Trinomial tasks))

Metrics:

Accuracy
Quadratic Weighted Kappa (QWK)
Precision
Recall
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different prompt engineering strategies using GPT-4 (Greedy Sampling). Showing that Few-Shot generally beats Zero-Shot, and Context+Rubric is crucial for CoT.
Average across 6 Tasks	Accuracy	0.5487	0.6604	+0.1117
Average across 6 Tasks	Accuracy	0.5487	0.6831	+0.1344
Average across 6 Tasks	Accuracy	0.5532	0.6831	+0.1299
Comparison of Model Performance (GPT-4 vs GPT-3.5) using the best prompt strategy (Few-Shot CoT with Context/Rubric) and Greedy Sampling.
Average across 6 Tasks	Accuracy	0.6111	0.6975	+0.0864

Main Takeaways

Chain-of-Thought (CoT) is ineffective for automatic scoring unless paired with explicit scoring rubrics and item stems (Context); without them, the model 'reasons' its way to incorrect criteria.
CoT with Rubrics acts as a balancer: it significantly improves classification of minority or difficult classes (e.g., 'Beginning' proficiency) where standard prompts often default to the majority class.
Few-shot learning consistently outperforms Zero-shot, suggesting that even with powerful models like GPT-4, providing examples is critical for nuanced educational assessment.
The 'Voting' strategy (Ensemble of 3 calls) did not consistently outperform single-call Greedy sampling for GPT-4, suggesting Greedy is more efficient for this specific task.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Zero-shot vs. Few-shot prompting
Familiarity with Chain-of-Thought (CoT) reasoning
Basic knowledge of educational assessment rubrics

Key Terms

WRVRT: Writing, Reviewing, Validating, Revising, and Testing—a proposed iterative workflow for developing reliable educational prompts

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Binomial/Trinomial Scoring: Scoring systems with two levels (Correct/Incorrect) or three levels (Beginning/Developing/Proficient)

QWK: Quadratic Weighted Kappa—a metric measuring agreement between raters (or AI and human) that penalizes large disagreements more heavily than small ones

Greedy Sampling: A decoding strategy where the model always selects the highest probability next token (Temperature=0), ensuring deterministic outputs

Nucleus Sampling: A decoding strategy (Top-p) that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p, allowing for diversity

Item Stem: The main part of a test question that presents the problem or task to the student

Rubric: A scoring guide used to evaluate the quality of students' constructed responses