← Back to Paper List

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, Xiaoming Zhai
University of Georgia
Computers and Education: Artificial Intelligence (2023)
Reasoning Benchmark QA

📝 Paper Summary

AI in Education Automatic Essay Scoring
Combining Chain-of-Thought prompting with specific scoring rubrics and item context significantly improves GPT-4's accuracy in scoring student science explanations compared to standard zero-shot approaches.
Core Problem
Standard automatic scoring models require extensive labeled training data and technical expertise, while generic LLM prompts often fail to grasp the specific nuances of complex scientific scoring rubrics.
Why it matters:
  • Developing traditional supervised scoring models is labor-intensive and technically inaccessible for many educators and researchers
  • Standard 'black box' AI scoring lacks transparency, making it difficult for teachers to trust grades or provide actionable feedback to students
  • Generic LLM scoring often hallucinates criteria or misses specific evidence requirements defined in educational standards
Concrete Example: In Task H4_2 (a trinomial science question), a standard Zero-Shot CoT prompt correctly identified 'Proficient' students only 27.5% of the time. It failed because it generated its own reasoning path rather than following the specific grading rubric. By adding the Context and Rubric (CR) to the CoT prompt, accuracy on 'Proficient' students jumped to 68.33%.
Key Novelty
WRVRT Framework & Context-Aware CoT
  • Proposes WRVRT (Writing, Reviewing, Validating, Revising, Testing), an iterative prompt engineering workflow specifically designed for educational validity
  • Demonstrates that Chain-of-Thought (CoT) alone is ineffective for scoring; it requires explicit 'Context and Rubric' (CR) constraints to align model reasoning with pedagogical standards
Evaluation Highlights
  • Few-shot learning achieved 66.98% average accuracy across six tasks, outperforming zero-shot learning (59.50%) by 12.6%
  • Adding Context and Rubric (CR) to Zero-Shot Chain-of-Thought prompts increased accuracy by 13.44% (from 0.5532 to 0.6831)
  • GPT-4 with greedy sampling outperformed GPT-3.5 by 8.64% (0.6975 vs 0.6111) when using the best performing Few-Shot CoT + CR prompt
Breakthrough Assessment
4/10
A solid application paper establishing best practices for prompt engineering in educational assessment. While not architecturally novel, the finding that CoT fails without rubric constraints is practically valuable.
×