CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes

📝 Paper Summary

AI for Education Human-Computer Interaction (HCI)

CodeHelp employs a multi-stage LLM prompting pipeline with explicit guardrails to provide on-demand programming assistance while actively preventing the generation of direct solution code.

Core Problem

Standard LLMs (like ChatGPT) and code generators often provide direct solutions to programming assignments, leading to student over-reliance and hindering learning.

Why it matters:

Scalability limits of human TAs/instructors in large classes prevent timely support for all students
Students using raw LLMs may bypass the productive struggle required for learning CS concepts
Static hint systems are labor-intensive to author and rarely cover all possible student errors

Concrete Example: When a student asks an LLM 'Write a while loop that starts at the last character...', a standard model outputs the exact code. CodeHelp instead provides a conceptual explanation of `len()` and `range()` without writing the solution block.

Key Novelty

3-Stage Guardrailed Pipeline

Decomposes the help process into three distinct LLM calls: Sufficiency Check, Main Response Generation, and Code Removal
Uses a 'Code Removal' agent specifically prompted to rewrite responses if the main agent violates instructions and leaks code (a failure mode common in standard instruction-tuned models)
Scores multiple generated completions against an instructor-defined 'avoid set' (forbidden keywords) to select the most pedagogically appropriate response

Architecture

The logic flow of the CodeHelp response pipeline.

Evaluation Highlights

95% of surveyed students (n=45) agreed they would like to use CodeHelp in future Computer Science courses
80% of students agreed or strongly agreed that the tool helped them complete their work successfully
Cost-effective deployment at roughly $0.002 per query, estimated under $10 for a 50-student class per semester

Breakthrough Assessment

7/10

While not a fundamental architectural advance in ML, the 'Code Removal' pipeline is a practical, effective pattern for enforcing negative constraints (guardrails) where standard prompting often fails.

⚙️ Technical Details

Problem Definition

Setting: Real-time automated tutoring for novice programmers

Inputs: Structured student query containing: Programming Language, Code Snippet (optional), Error Message (optional), and Issue Description

Outputs: Natural language guidance/explanation without solution code

Pipeline Flow

Sufficiency Check (LLM)
Main Response Generation (LLM - Parallel)
Scoring & Selection
Code Removal (LLM - Conditional)

System Modules

Sufficiency Check

Determine if the student provided enough context to answer; if not, generate a clarification request

Model or implementation: gpt-3.5-turbo-0301

Main Response Generator

Generate educational explanation while attempting to avoid solution code and instructor-defined forbidden keywords

Model or implementation: gpt-3.5-turbo-0301

Code Remover

Rewrites the selected response if it contains code blocks (markdown), ensuring no solution leaks

Model or implementation: text-davinci-003

Novel Architectural Elements

Recursive-like repair pipeline: A specialized 'Code Removal' LLM call is triggered conditionally only when the primary generation fails negative constraints (contains code blocks)
Parallel generation and heuristic scoring of candidate responses based on constraint violation (keywords/code blocks)

Modeling

Base Model: OpenAI GPT-3.5 Family (gpt-3.5-turbo-0301 and text-davinci-003)

Compute: Inference only. Avg cost ~$0.002 per query (OpenAI API pricing circa June 2023).

Comparison to Prior Work

vs. Coding Steps: CodeHelp explicitly hides code solutions via post-processing, whereas Coding Steps provides code directly
vs. Hellas et al. (Standard GPT-3.5): CodeHelp adds a 'Code Removal' agent to fix the 'blurting out answers' behavior inherent in raw models
vs. Pyo: CodeHelp uses LLMs for open-ended queries rather than rule-based flows, allowing support for any language or topic
+ 1 more
vs. Standard ChatGPT [not cited in paper]: CodeHelp incorporates instructor-specific 'avoid sets' to prevent using advanced concepts (e.g., asking for a loop but getting a `sum()` function)

Limitations

Hallucinations: Responses can be factually incorrect (mitigated by warning banners)
One-shot interaction: No dialogue history or follow-up capability (design choice to prevent jailbreaking/over-reliance)
Context limit: Students sometimes provide insufficient info, and the tool can't 'see' their full environment like a human TA
Model Bias: 'Code Removal' step relies on `text-davinci-003` which is more expensive; `turbo` failed at this task

📊 Experiments & Results

Evaluation Setup

In-classroom deployment study (12 weeks)

Benchmarks:

Classroom Deployment (Introductory CS (Python/Pandas) support) [New]

Metrics:

Student Survey Agreement (Likert Scale)
Usage Statistics (Queries per student/week)
Qualitative Thematic Analysis
Statistical methodology: Descriptive statistics (percentages, means) and qualitative thematic analysis

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Classroom Deployment	Future Use Interest	5	95	+90
Classroom Deployment	Perceived Utility (Success)	20	80	+60
Classroom Deployment	Perceived Utility (Learning)	37	63	+26

Experiment Figures

Percentage of students using CodeHelp each week over the 12-week semester.

Heatmap of query volume by hour and day of week.

Main Takeaways

Students used the tool consistently throughout the semester (roughly 50% of the class active each week) without being forced, indicating genuine utility.
The 'Code Removal' guardrail effectively prevented solution leakage in most cases, though students occasionally received answers using concepts not yet covered in class (mitigated by 'avoid sets').
Qualitative feedback highlighted 'Availability' (24/7 support) as the primary benefit, followed by help with 'Fixing Errors'.
Usage heatmaps showed activity clustered around class times but also significant usage during nights/weekends when human staff were unavailable.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and prompting
Familiarity with Introductory Computer Science (CS1) pedagogy

Key Terms

CS1: Computer Science 1—the standard introductory programming course for undergraduates

Guardrails: Mechanisms (logic or prompts) designed to restrict model outputs to safe, pedagogical bounds (e.g., preventing solution disclosure)

Chain of Thought: A prompting technique where the model is instructed to articulate its reasoning steps before producing a final answer

LTI: Learning Tools Interoperability—a standard protocol that allows learning tools (like CodeHelp) to integrate with Learning Management Systems (like Canvas/Moodle)

Hallucination: The generation of text by an LLM that is factually incorrect or nonsensical