We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

📝 Paper Summary

Code Generation Security Factuality and Hallucination

The paper characterizes 'package hallucinations'—where code-generating LLMs recommend non-existent software packages—as a critical security threat enabling supply chain attacks, and demonstrates effective mitigation via RAG and fine-tuning.

Core Problem

Code-generating LLMs frequently hallucinate non-existent package names in their output. Attackers can exploit this by publishing malicious packages with these exact names to open-source repositories.

Why it matters:

Creates a direct vector for 'package confusion attacks' where developers inadvertently install malware recommended by trusted AI tools
Traditional typosquatting defenses fail because the 'typo' is generated by the user's own tool (the LLM), not the attacker
Existing research focuses on natural language hallucinations, leaving the specific security implications of code dependency hallucinations underexplored

Concrete Example: A user asks an LLM for Python code to solve a specific task. The LLM generates code importing a fictitious library `ar-python-tools`. An attacker, having predicted this hallucination, has already published a malicious package named `ar-python-tools` on PyPI. The user runs `pip install ar-python-tools`, compromising their system.

Key Novelty

Systematic Characterization and Mitigation of Package Hallucinations

Conducts the first large-scale measurement study (576,000 code samples) of package hallucinations across 16 commercial and open-source models
Identifies specific hallucination behaviors, such as 'cross-language hallucinations' (recommending a JS package for Python code) and the impact of 'deleted packages'
Evaluates mitigation strategies specifically for this threat model, finding that Retrieval-Augmented Generation (RAG) significantly outperforms self-correction methods

Architecture

Overview of the Package Hallucination Attack Scenario

Evaluation Highlights

Commercial models hallucinate packages in at least 5.2% of generated code, while open-source models hallucinate in 21.7% of cases on average
Identified a staggering 205,474 unique examples of hallucinated package names across the study
RAG mitigation reduces package hallucinations by 64.9% (relative improvement) while maintaining code quality, outperforming supervised fine-tuning and self-detection

Breakthrough Assessment

7/10

While the concept of package hallucination was previously known, this is the first rigorous, large-scale empirical study quantifying the threat across many models and evaluating specific mitigations. It establishes a critical baseline for code generation security.

⚙️ Technical Details

Problem Definition

Setting: Code generation given a natural language prompt, specifically analyzing the validity of external software dependencies (packages) imported in the output

Inputs: Natural language programming question or prompt (from Stack Overflow or similar)

Outputs: Source code (Python or JavaScript) containing import statements

Pipeline Flow

Prompt Generation (Stack Overflow + Curated)
Code Generation (LLM Inference)
Hallucination Detection (Repository Cross-referencing)

System Modules

Prompt Generator

Create diverse coding prompts

Model or implementation: Script-based extraction

Code Generator

Generate code solutions from prompts

Model or implementation: 16 different LLMs (GPT-4, GPT-3.5, Claude, CodeLlama, DeepSeek Coder, etc.)

Hallucination Detector

Identify non-existent packages

Model or implementation: Rule-based script

Novel Architectural Elements

Pipeline specifically designed for large-scale stress-testing of package hallucinations using real-world Stack Overflow data
Integration of temporal analysis (questions post-dating training data) to isolate recency bias in hallucinations

Modeling

Base Model: Evaluation covers 16 models including GPT-4, GPT-3.5-Turbo, Claude-3, CodeLlama (7B-70B), DeepSeek Coder, StarCoder2

Training Method: Supervised Fine-Tuning (SFT) and Retrieval Augmented Generation (RAG) applied as mitigation strategies

Adaptation: LoRA used for SFT experiments (rank=32, alpha=64)

Trainable Parameters: Not reported for the commercial baselines; SFT applied to open-source models (e.g., CodeLlama)

Training Data:

SFT Dataset: 5,000 samples of 'clean' code (verified valid packages) generated by GPT-4

Key Hyperparameters:

temperature: Evaluated range [0.1, 1.0]
top_p: 0.95 (common default)
learning_rate: 2e-4 (for SFT)
+ 2 more
batch_size: 128 (for SFT)
epochs: 3 (for SFT)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Lanyado: Much larger scale (576k samples vs. small test), broader model selection (open source + closed), and rigorous mitigation evaluation. Lanyado estimated 5x higher rates likely due to smaller sample size or different prompting.
vs. Liu et al.: Focuses specifically on the security threat of *package* hallucinations rather than general code syntax/logic errors.
vs. He et al. (Package Hunter) [not cited in paper]: Focuses on generating the hallucination rather than detecting malicious packages already in the registry.

Limitations

Relies on the existence of packages in PyPI/npm at the time of scanning; packages created/deleted during the study window could introduce noise.
Does not execute the code to verify functional correctness beyond package existence.
Mitigation strategies evaluated primarily on open-source models due to cost/access constraints of fine-tuning commercial APIs.

Reproducibility

Code: https://github.com/Spracks/PackageHallucination

publicly available (https://github.com/Spracks/PackageHallucination). The repository contains the generated datasets (Stack Overflow prompts), the code for the hallucination detection pipeline, and the generated code samples. Commercial model weights are obviously not available.

📊 Experiments & Results

Evaluation Setup

Generation of Python and JavaScript code from natural language prompts derived from Stack Overflow questions.

Benchmarks:

Stack Overflow Dataset (Real-world Q&A Code Generation) [New]
Augmented Stack Overflow (Temporal) (Code Generation with Recency Split) [New]

Metrics:

Hallucination Rate (HR)
Pass@1 (Code Quality/Executability - implied via BLEU/CodeBLEU in mitigation section)
Package Recurrence
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Prevalence of package hallucinations across model types.
Stack Overflow Dataset	Hallucination Rate (Avg)	5.2	21.7	+16.5
Stack Overflow Dataset	Unique Hallucinations	0	205474	+205474
Effectiveness of mitigation strategies (tested on CodeLlama-7B).
Stack Overflow Dataset	Hallucination Rate	25.6	9.0	-16.6
Stack Overflow Dataset	Hallucination Rate	25.6	18.4	-7.2
Stack Overflow Dataset	Hallucination Rate	25.6	22.1	-3.5

Experiment Figures

Hallucination rates across different temperatures for various models.

Main Takeaways

Package hallucinations are a systemic issue, appearing in both commercial (5.2%) and open-source (21.7%) models, posing a real security risk.
RAG is the most effective mitigation strategy (reducing hallucinations by ~65%), suggesting that lack of up-to-date knowledge is a primary driver of these errors.
High temperature settings increase the frequency of package hallucinations, confirming a trade-off between creativity and security.
Models often exhibit 'cross-language' hallucinations, recommending valid Python packages when generating JavaScript code and vice-versa.
Simple self-correction prompting is largely ineffective for this specific problem, likely because the model lacks the external knowledge to verify existence.

📚 Prerequisite Knowledge

Prerequisites

Understanding of software dependency management (pip, npm)
Basic knowledge of LLM code generation
Familiarity with software supply chain attacks

Key Terms

package hallucination: When an LLM generates code importing a software library or package that does not actually exist in the official repository (e.g., PyPI, npm)

package confusion attack: A supply chain attack where a user is tricked into installing a malicious package that has a name similar to or identical to a legitimate or expected package

typosquatting: Registering a package name that is a common misspelling of a popular package to capture accidental downloads

RAG: Retrieval-Augmented Generation—providing the LLM with relevant, retrieved external data (here, valid package lists or documentation) to ground its generation

SFT: Supervised Fine-Tuning—retraining a model on a specific dataset to improve its performance on a target task

hallucination rate: The percentage of generated code samples that contain at least one non-existent package reference

PyPI: Python Package Index—the official third-party software repository for Python

npm: Node Package Manager—the default package manager and repository for the JavaScript runtime environment Node.js