← Back to Paper List

Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities

Arjun Krishna, Erick Galinkin, Leon Derczynski, Jeffrey Martin
NVIDIA
arXiv (2025)
Factuality Benchmark

📝 Paper Summary

Hallucination suppression Factuality in Code Generation Software Supply Chain Security
The paper detects and measures package hallucination—LLMs generating non-existent code dependencies—across multiple languages and models, revealing that larger models and those with better coding benchmark scores hallucinate less.
Core Problem
LLMs frequently hallucinate software package names that do not exist, creating security vulnerabilities where attackers can register these names to distribute malware (supply chain attacks).
Why it matters:
  • Attackers can register hallucinated package names with malicious code, compromising developers who blindly trust LLM suggestions
  • Millions of developers use open-source repositories (NPM, PyPI, crates.io), making them high-value targets for supply chain attacks via typosquatting or hallucination exploitation
  • Current coding models are not optimized for security, often prioritizing plausible-sounding code over factual dependency verification
Concrete Example: A developer asks for Python code to store passwords safely. The LLM suggests `import securehashlib` (a non-existent package). A malicious actor, having anticipated this, has already registered `securehashlib` on PyPI with malware. When the developer runs `pip install securehashlib`, their system is compromised.
Key Novelty
Cross-language Package Hallucination Analysis & Defensive Heuristics
  • Systematic measurement of 'Package Hallucination Rate' (PHR) across Python, JavaScript, and Rust for both general-purpose and coding-specialized LLMs
  • Identification of an inverse correlation between standard coding benchmarks (HumanEval) and package hallucination, proposing benchmark scores as a proxy for security risk
Evaluation Highlights
  • Large models (≥70B parameters) show statistically significant lower hallucination rates compared to smaller models (p=0.00028)
  • Strong inverse correlation found between HumanEval benchmark performance and Package Hallucination Rate (ρ=-0.7887), offering a heuristic for model selection
  • JavaScript consistently exhibits lower hallucination rates (mean PHR ~13-20%) compared to Python (mean ~26%) and Rust (mean ~25%) across tested models
Breakthrough Assessment
7/10
Provides the first comprehensive analysis of package hallucination across multiple languages and models. While the defensive strategies are heuristics rather than architectural fixes, the correlation with HumanEval is a valuable practical finding.
×