How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

📝 Paper Summary

LLMs for Education (EdTech) Human-AI Collaboration Intelligent Tutoring Systems

HypoCompass reverses the traditional roles by having students act as Teaching Assistants to debug imperfect LLM-generated code, improving their own hypothesis construction and testing skills.

Core Problem

Novice programmers lack sufficient practice in debugging and hypothesis construction because creating specialized debugging exercises is time-consuming for instructors.

Why it matters:

LLMs (like Copilot) are now ubiquitous 'AI pair programmers' but frequently make subtle mistakes (up to 17% error rates in basic tasks), requiring students to have strong evaluation skills
Debugging is often overlooked in CS1 curricula due to the high logistical cost of creating materials
Students currently learn debugging inefficiently by struggling with their own code, mixing hypothesis formation with the cognitive load of code writing

Concrete Example: A student struggling to debug their own code must simultaneously understand the logic, write the syntax, and hypothesize bugs. In HypoCompass, the student delegates code writing to the LLM and focuses purely on creating test cases (hypotheses) to identify why the LLM's code fails.

Key Novelty

LLM as a Teachable Agent (Role Reversal)

Simulates a 'reverse' classroom where the LLM plays a confused student and the human user plays the Teaching Assistant (TA)
Uses 'over-generate-then-select' prompting to create diverse, naturally buggy programs from a single problem description
Disentangles learning objectives: students focus on high-level hypothesis testing while the LLM handles low-level code completion and bug fixing based on student instructions

Architecture

The pipeline for generating practice materials using LLMs.

Evaluation Highlights

HypoCompass generates high-quality training materials (bugs, fixes, tests) 4x faster than human Teaching Assistants
Students using HypoCompass improved their debugging performance by 12% from pre-test to post-test
Students reduced their task completion time by 14% after training with the system

Breakthrough Assessment

7/10

Strong application of LLMs to solve a specific pedagogical bottleneck (debugging practice). The role-reversal design is clever and the efficiency gains over human material generation are significant.

⚙️ Technical Details

Problem Definition

Setting: Interactive environment for practicing hypothesis construction in debugging

Inputs: Programming problem description, reference solution, and student-provided test cases/explanations

Outputs: LLM-generated buggy code, immediate feedback on student hypotheses, and revised code

Pipeline Flow

Material Generation (Offline): Problem Description → LLM Over-generation → Selection Algorithm → Buggy Code/Hints
Student Interaction (Online): Student writes Test Cases → LLM Agent presents Buggy Code → Student explains Bug → LLM Agent attempts fix

System Modules

Material Generator

Create buggy programs, explanations, and test hints from a reference solution

Model or implementation: gpt-3.5-turbo (general), gpt-4 (explanations)

Office Hour Queue Simulator (Interaction Interface)

Interface where students select 'students' (agents) to help

Model or implementation: N/A (UI Component)

LLM-Agent (Student Simulator) (Interaction Interface)

Simulate a novice student responding to the user's debugging advice

Model or implementation: gpt-3.5-turbo

Novel Architectural Elements

Integration of an 'over-generate-then-select' pipeline for educational material generation that uses clustering (Agglomerative Hierarchical Clustering) and distance metrics to ensure pedagogical diversity
Translation-based prompting for bug fixing: reframing bug fixing as 'old code → new code' translation to prevent the LLM from over-fixing code beyond the specific user instruction

Modeling

Base Model: gpt-3.5-turbo for most tasks; gpt-4 for explanation generation

Compute: Not reported in the paper

Comparison to Prior Work

vs. Manual Authoring: HypoCompass is 4.67x faster at generating materials
vs. Standard LLM usage: HypoCompass constrains the LLM to be a 'confused student' rather than an expert tutor, forcing the human to do the cognitive work of evaluation

Limitations

Relies on the availability of a correct reference solution and test suite
Behaviorally distinct bugs are selected via test outputs, which may not capture all semantic nuances
Evaluation was conducted with a relatively small sample size (19 students)
Limited to Python programming tasks in the current implementation

Reproducibility

Code: http://tinyurl.com/hypocompass-sup

Prompts are provided in supplemental materials (Table 3). Codebook for annotation is discussed. Full system code link points to tinyurl with supplements.

📊 Experiments & Results

Evaluation Setup

Pre-test / Post-test user study with novice programmers

Benchmarks:

Custom Python Problems (Debugging and Test Case Generation) [New]

Metrics:

Test Score (Comprehensive & Accurate Hypothesis Construction)
Completion Time
Material Generation Success Rate
Material Generation Time
Statistical methodology: Inter-rater reliability (IRR) using Cohen's Kappa for material quality

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Material Audit	Success Rate	100%	90%	-10%
Time Log	Minutes per Problem Set	71	15	-56
Pre/Post Test	Score Improvement	Not reported in the paper	Not reported in the paper	+12%
Pre/Post Test	Completion Time	Not reported in the paper	Not reported in the paper	-14%

Experiment Figures

Visualization of the 'Over-generate-then-select' strategy.

Main Takeaways

LLMs can effectively replace human effort in generating debugging exercises when using 'over-generate-then-select' strategies.
The 'Teachable Agent' role (human teaches LLM) significantly improves student debugging accuracy and speed.
Framing bug fixing as a translation task (Old->New) mitigates the LLM's tendency to 'over-fix' code, reducing errors by 70%.
Hierarchical disentanglement in prompts (prioritizing bug extraction over word limits) improves generation success by over 40%.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of CS1 (Introductory Computer Science) concepts
Familiarity with Unit Testing and Test Cases
Understanding of Large Language Models as generative tools

Key Terms

CS1: Introductory Computer Science course (Computer Science 1)

Teachable Agent: An educational technology framework where the student learns by teaching a computer agent

Hypothesis Construction: The cognitive process of formulating explanations for why a program is behaving incorrectly

LLM-chain: Decomposing a complex LLM task into sequential sub-tasks handled by separate prompts

Over-generate-then-select: A prompting strategy where the LLM generates many candidates (e.g., buggy programs), and an algorithm selects the best subset based on diversity or quality metrics

Test Suite: A collection of test cases (inputs and expected outputs) used to verify that a program functions correctly