CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

📝 Paper Summary

Code Generation Hallucination Detection

The paper introduces the concept of code hallucination—code that is syntactically correct but functionally flawed—and provides a taxonomy, a dynamic detection algorithm, and a benchmark to evaluate LLMs.

Core Problem

LLMs often generate code that is syntactically correct and semantically plausible but fails to execute as expected or meet requirements, a phenomenon distinct from simple syntax errors.

Why it matters:

Code value is only realized upon successful execution and testing; NL hallucination definitions do not directly apply to executable code
Hallucinated code can trigger runtime errors or functional defects, hindering reliable deployment in automated software development
Current benchmarks focus on pass rates (performance) rather than systematically categorizing and quantifying the *types* of non-syntax errors (hallucinations)

Concrete Example: A model might generate code that enters an infinite loop (e.g., repeatedly calling a function) due to logical collapse, eventually hitting a token limit. While this manifests as a SyntaxError or Timeout, the root cause is a 'Logical Hallucination', distinct from a simple typo like an undefined variable name.

Key Novelty

Execution-based Code Hallucination Taxonomy & Detection

Defines 'code hallucination' specifically as code that may be syntactically correct but fails verification, distinguishing it from simple code errors
Categorizes hallucinations into four main types (Mapping, Naming, Resource, Logic) using a two-stage heuristic approach
Uses a statistical induction method (CodeHalu) that executes code across multiple iterations to identify persistent failure patterns

Architecture

The CodeHalu detection process: Validation -> Identification -> Construction.

Evaluation Highlights

Average cross-task occurrence rate of hallucination categories is low (2.04%), confirming the independence of the proposed taxonomy
Even for Gemma-7B (which exhibited severe hallucinations), only 1.07% of task samples showed cross-task hallucinations
Evaluation of 16 LLMs on 105,958 samples showed an exceptionally low average syntax error rate of 0.0020, confirming that most generated code is syntactically valid but potentially hallucinated

Breakthrough Assessment

7/10

Establishes a necessary formal definition and taxonomy for code hallucinations, distinct from NLP hallucinations. The benchmark and detection algorithm are valuable tools, though the approach is primarily diagnostic.

⚙️ Technical Details

Problem Definition

Setting: Code generation from natural language descriptions with execution-based verification

Inputs: Problem description Q and test cases (input ip, expected output op)

Outputs: Generated code solution GC

Pipeline Flow

Instruction & Problem -> Code Generation (via LLM)
Execution Verification (run against test cases)
State Detection (identify stuttering, loops, errors)
Pattern Induction (frequency analysis across iterations)
Classification (categorize into 4 types: Mapping, Naming, Resource, Logic)

System Modules

Generator

Generate code candidates based on problem description

Model or implementation: Various LLMs (e.g., GPT-3.5, Llama-2, etc.)

Executor

Execute generated code against test cases to detect failures

Model or implementation: Python execution environment

Detector (CodeHalu)

Aggregates execution results to identify persistent hallucination patterns

Model or implementation: Statistical induction algorithm

Novel Architectural Elements

Active hallucination detection pipeline: integrates execution feedback loop directly into the analysis process rather than passive post-hoc text analysis
Two-stage heuristic classification system mapping execution states to 4 specific hallucination categories

Modeling

Base Model: 17 popular LLMs evaluated (exact list not fully enumerated in summary text but includes Gemma-7B)

Compute: Not reported in the paper

Comparison to Prior Work

vs. HumanEval/APPS: Focuses on classifying *why* code fails (hallucinations) rather than just if it passes
vs. NLP methods: Uses 'active' strategy (execution verification) rather than passive text analysis, acknowledging that code must be executable
vs. SWE-bench [not cited in paper]: Focuses on atomic code generation hallucinations rather than repository-level issue resolving

Limitations

Analysis primarily focused on Python programming language
Relies on the availability of test cases for execution verification
Classification is heuristic-based and may require consensus for edge cases

Reproducibility

Code: https://github.com/yuchen814/CodeHalu

Benchmark and code publicly available at https://github.com/yuchen814/CodeHalu. The paper details the classification taxonomy and the definition of hallucinations.

📊 Experiments & Results

Evaluation Setup

Execution-based verification of generated code against test cases with resource constraints

Benchmarks:

CodeHaluEval (Code Generation & Hallucination Detection) [New]

Metrics:

Hallucination Occurrence Rate
Syntax Error Rate
Cross-task Occurrence Rate
Statistical methodology: Statistical induction based on frequency of patterns across >15 generation attempts per sample

Key Results

Benchmark	Metric	This Paper	Δ
Internal Test (105,958 samples)	Average Error Rate	0.0020	+0.0020
CodeHaluEval	Average Cross-task Occurrence Rate	2.04	+2.04
CodeHaluEval	Survey Rationality Rating	91.08	+91.08

Experiment Figures

Taxonomy of Code Hallucinations

Distinction between Code Error and Code Hallucination

Main Takeaways

Code generated by LLMs is almost always syntactically correct (0.0020 error rate), meaning failures are usually 'hallucinations' (logic/resource/naming issues) rather than simple grammar errors.
Code hallucinations can be distinctly categorized into Mapping, Naming, Resource, and Logic types, with low overlap between categories.
The proposed active detection strategy (execution-based) is necessary because code plausibility (looking correct) does not equal code correctness (running correctly).

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM code generation
Basic software testing concepts (unit tests, execution constraints)
Familiarity with hallucination in NLP

Key Terms

Code Hallucination: Code generated by LLMs that is syntactically correct or semantically plausible but fails to execute as expected or meet requirements

Code Error: Specific subset of hallucinations referring to issues that cause a program to stop executing (e.g., NameError)

Mapping Hallucinations: Ambiguity in mapping data types/values (e.g., accessing non-existent array indices)

Naming Hallucinations: Memory-related issues regarding variable/module names (e.g., importing non-existent modules)

Resource Hallucinations: Lack of perception regarding resource consumption (e.g., memory overflow, infinite loops)

Logic Hallucinations: Discrepancies between expected/actual results or logical breakdown (e.g., generating chaos/gibberish)

CodeHalu: A dynamic detection algorithm that uses statistical induction based on execution validation to identify hallucination patterns

CodeHaluEval: A benchmark proposed in this paper containing 8,883 samples from 699 tasks to evaluate code hallucinations