← Back to Paper List

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, Murtuza Jadliwala
University of Texas at San Antonio, University of Oklahoma
arXiv (2024)
Factuality Benchmark RAG

📝 Paper Summary

Code Generation Security Factuality and Hallucination
The paper characterizes 'package hallucinations'—where code-generating LLMs recommend non-existent software packages—as a critical security threat enabling supply chain attacks, and demonstrates effective mitigation via RAG and fine-tuning.
Core Problem
Code-generating LLMs frequently hallucinate non-existent package names in their output. Attackers can exploit this by publishing malicious packages with these exact names to open-source repositories.
Why it matters:
  • Creates a direct vector for 'package confusion attacks' where developers inadvertently install malware recommended by trusted AI tools
  • Traditional typosquatting defenses fail because the 'typo' is generated by the user's own tool (the LLM), not the attacker
  • Existing research focuses on natural language hallucinations, leaving the specific security implications of code dependency hallucinations underexplored
Concrete Example: A user asks an LLM for Python code to solve a specific task. The LLM generates code importing a fictitious library `ar-python-tools`. An attacker, having predicted this hallucination, has already published a malicious package named `ar-python-tools` on PyPI. The user runs `pip install ar-python-tools`, compromising their system.
Key Novelty
Systematic Characterization and Mitigation of Package Hallucinations
  • Conducts the first large-scale measurement study (576,000 code samples) of package hallucinations across 16 commercial and open-source models
  • Identifies specific hallucination behaviors, such as 'cross-language hallucinations' (recommending a JS package for Python code) and the impact of 'deleted packages'
  • Evaluates mitigation strategies specifically for this threat model, finding that Retrieval-Augmented Generation (RAG) significantly outperforms self-correction methods
Evaluation Highlights
  • Commercial models hallucinate packages in at least 5.2% of generated code, while open-source models hallucinate in 21.7% of cases on average
  • Identified a staggering 205,474 unique examples of hallucinated package names across the study
  • RAG mitigation reduces package hallucinations by 64.9% (relative improvement) while maintaining code quality, outperforming supervised fine-tuning and self-detection
Breakthrough Assessment
7/10
While the concept of package hallucination was previously known, this is the first rigorous, large-scale empirical study quantifying the threat across many models and evaluating specific mitigations. It establishes a critical baseline for code generation security.
×