Mark H. Liffiton, Brad E. Sheese, Jaromir Savelka, Paul Denny
Illinois Wesleyan University,
Carnegie Mellon University,
The University of Auckland
European Conference on Modelling and Simulation
(2023)
AgentQAReasoningFactuality
📝 Paper Summary
AI for EducationHuman-Computer Interaction (HCI)
CodeHelp employs a multi-stage LLM prompting pipeline with explicit guardrails to provide on-demand programming assistance while actively preventing the generation of direct solution code.
Core Problem
Standard LLMs (like ChatGPT) and code generators often provide direct solutions to programming assignments, leading to student over-reliance and hindering learning.
Why it matters:
Scalability limits of human TAs/instructors in large classes prevent timely support for all students
Students using raw LLMs may bypass the productive struggle required for learning CS concepts
Static hint systems are labor-intensive to author and rarely cover all possible student errors
Concrete Example:When a student asks an LLM 'Write a while loop that starts at the last character...', a standard model outputs the exact code. CodeHelp instead provides a conceptual explanation of `len()` and `range()` without writing the solution block.
Key Novelty
3-Stage Guardrailed Pipeline
Decomposes the help process into three distinct LLM calls: Sufficiency Check, Main Response Generation, and Code Removal
Uses a 'Code Removal' agent specifically prompted to rewrite responses if the main agent violates instructions and leaks code (a failure mode common in standard instruction-tuned models)
Scores multiple generated completions against an instructor-defined 'avoid set' (forbidden keywords) to select the most pedagogically appropriate response
Architecture
The logic flow of the CodeHelp response pipeline.
Evaluation Highlights
95% of surveyed students (n=45) agreed they would like to use CodeHelp in future Computer Science courses
80% of students agreed or strongly agreed that the tool helped them complete their work successfully
Cost-effective deployment at roughly $0.002 per query, estimated under $10 for a 50-student class per semester
Breakthrough Assessment
7/10
While not a fundamental architectural advance in ML, the 'Code Removal' pipeline is a practical, effective pattern for enforcing negative constraints (guardrails) where standard prompting often fails.
⚙️ Technical Details
Problem Definition
Setting: Real-time automated tutoring for novice programmers
Outputs: Natural language guidance/explanation without solution code
Pipeline Flow
Sufficiency Check (LLM)
Main Response Generation (LLM - Parallel)
Scoring & Selection
Code Removal (LLM - Conditional)
System Modules
Sufficiency Check
Determine if the student provided enough context to answer; if not, generate a clarification request
Model or implementation: gpt-3.5-turbo-0301
Main Response Generator
Generate educational explanation while attempting to avoid solution code and instructor-defined forbidden keywords
Model or implementation: gpt-3.5-turbo-0301
Code Remover
Rewrites the selected response if it contains code blocks (markdown), ensuring no solution leaks
Model or implementation: text-davinci-003
Novel Architectural Elements
Recursive-like repair pipeline: A specialized 'Code Removal' LLM call is triggered conditionally only when the primary generation fails negative constraints (contains code blocks)
Parallel generation and heuristic scoring of candidate responses based on constraint violation (keywords/code blocks)
Modeling
Base Model: OpenAI GPT-3.5 Family (gpt-3.5-turbo-0301 and text-davinci-003)
Compute: Inference only. Avg cost ~$0.002 per query (OpenAI API pricing circa June 2023).
Comparison to Prior Work
vs. Coding Steps: CodeHelp explicitly hides code solutions via post-processing, whereas Coding Steps provides code directly
vs. Hellas et al. (Standard GPT-3.5): CodeHelp adds a 'Code Removal' agent to fix the 'blurting out answers' behavior inherent in raw models
vs. Pyo: CodeHelp uses LLMs for open-ended queries rather than rule-based flows, allowing support for any language or topic
vs. Standard ChatGPT [not cited in paper]: CodeHelp incorporates instructor-specific 'avoid sets' to prevent using advanced concepts (e.g., asking for a loop but getting a `sum()` function)
Limitations
Hallucinations: Responses can be factually incorrect (mitigated by warning banners)
One-shot interaction: No dialogue history or follow-up capability (design choice to prevent jailbreaking/over-reliance)
Context limit: Students sometimes provide insufficient info, and the tool can't 'see' their full environment like a human TA
Model Bias: 'Code Removal' step relies on `text-davinci-003` which is more expensive; `turbo` failed at this task
Statistical methodology: Descriptive statistics (percentages, means) and qualitative thematic analysis
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Classroom Deployment
Future Use Interest
5
95
+90
Classroom Deployment
Perceived Utility (Success)
20
80
+60
Classroom Deployment
Perceived Utility (Learning)
37
63
+26
Experiment Figures
Percentage of students using CodeHelp each week over the 12-week semester.
Heatmap of query volume by hour and day of week.
Main Takeaways
Students used the tool consistently throughout the semester (roughly 50% of the class active each week) without being forced, indicating genuine utility.
The 'Code Removal' guardrail effectively prevented solution leakage in most cases, though students occasionally received answers using concepts not yet covered in class (mitigated by 'avoid sets').
Qualitative feedback highlighted 'Availability' (24/7 support) as the primary benefit, followed by help with 'Fixing Errors'.
Usage heatmaps showed activity clustered around class times but also significant usage during nights/weekends when human staff were unavailable.
📚 Prerequisite Knowledge
Prerequisites
Basic understanding of Large Language Models (LLMs) and prompting
Familiarity with Introductory Computer Science (CS1) pedagogy
Key Terms
CS1: Computer Science 1—the standard introductory programming course for undergraduates
Guardrails: Mechanisms (logic or prompts) designed to restrict model outputs to safe, pedagogical bounds (e.g., preventing solution disclosure)
Chain of Thought: A prompting technique where the model is instructed to articulate its reasoning steps before producing a final answer
LTI: Learning Tools Interoperability—a standard protocol that allows learning tools (like CodeHelp) to integrate with Learning Management Systems (like Canvas/Moodle)
Hallucination: The generation of text by an LLM that is factually incorrect or nonsensical