Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities

📝 Paper Summary

Hallucination suppression Factuality in Code Generation Software Supply Chain Security

The paper detects and measures package hallucination—LLMs generating non-existent code dependencies—across multiple languages and models, revealing that larger models and those with better coding benchmark scores hallucinate less.

Core Problem

LLMs frequently hallucinate software package names that do not exist, creating security vulnerabilities where attackers can register these names to distribute malware (supply chain attacks).

Why it matters:

Attackers can register hallucinated package names with malicious code, compromising developers who blindly trust LLM suggestions
Millions of developers use open-source repositories (NPM, PyPI, crates.io), making them high-value targets for supply chain attacks via typosquatting or hallucination exploitation
Current coding models are not optimized for security, often prioritizing plausible-sounding code over factual dependency verification

Concrete Example: A developer asks for Python code to store passwords safely. The LLM suggests `import securehashlib` (a non-existent package). A malicious actor, having anticipated this, has already registered `securehashlib` on PyPI with malware. When the developer runs `pip install securehashlib`, their system is compromised.

Key Novelty

Cross-language Package Hallucination Analysis & Defensive Heuristics

Systematic measurement of 'Package Hallucination Rate' (PHR) across Python, JavaScript, and Rust for both general-purpose and coding-specialized LLMs
Identification of an inverse correlation between standard coding benchmarks (HumanEval) and package hallucination, proposing benchmark scores as a proxy for security risk

Evaluation Highlights

Large models (≥70B parameters) show statistically significant lower hallucination rates compared to smaller models (p=0.00028)
Strong inverse correlation found between HumanEval benchmark performance and Package Hallucination Rate (ρ=-0.7887), offering a heuristic for model selection
JavaScript consistently exhibits lower hallucination rates (mean PHR ~13-20%) compared to Python (mean ~26%) and Rust (mean ~25%) across tested models

Breakthrough Assessment

7/10

Provides the first comprehensive analysis of package hallucination across multiple languages and models. While the defensive strategies are heuristics rather than architectural fixes, the correlation with HumanEval is a valuable practical finding.

⚙️ Technical Details

Problem Definition

Setting: Code generation tasks where the output includes import statements or dependency references

Inputs: Natural language prompt requesting code for a specific task (t) in a specific programming language (p)

Outputs: Generated code snippet containing package imports

Pipeline Flow

Prompt Generation (Task + Language Stub)
Model Inference (Generate Code)
Output Parsing (Extract Imports)
Verification (Check against Known-Good Package Lists)

System Modules

Prompt Generator

Combine request stubs (R) and coding tasks (T) for target languages (P)

Model or implementation: Deterministic template engine

Target LLM

Generate code based on the input prompt

Model or implementation: Various (e.g., GPT-4o, Llama-3.1-70B, Qwen2.5-Coder)

Package Verifier

Check extracted imports against official repository databases

Model or implementation: Lookup table (Scraped PyPI, NPM, crates.io data)

Modeling

Base Model: Evaluation of 11 models including GPT-4o, Llama-3.1 (8B/70B), Qwen2.5-Coder, Granite-3.0, Mistral-Nemo

Limitations

Cannot analyze training data composition for closed models to confirm why certain languages hallucinate less
Sample size of 11 models may not be large enough to statistically confirm the difference between coding and general-purpose models
Analysis relies on scraped package data which captures a specific snapshot in time
Distinction between induced and natural hallucination based on a limited number of explicit attempts

Reproducibility

Code availability is not provided. The paper lists all prompt templates (request stubs and tasks) in Tables 6 and 7. The evaluation relies on the 'garak' framework which is public, but the specific scripts for this paper are not linked. Package data was scraped from public repositories (PyPI, NPM, crates.io).

📊 Experiments & Results

Evaluation Setup

Generation of code for specific tasks across 3 languages (Python, JS, Rust), checked against package registries

Benchmarks:

Custom Package Hallucination Test (Code Generation & Dependency Verification) [New]
HumanEval (Python Code Generation)
MBPP (Python Code Generation)

Metrics:

Package Hallucination Rate (PHR)
Statistical methodology: Pearson correlation coefficient (ρ) and p-values reported for correlations between model size/benchmarks and PHR. T-test used for coding vs. general-purpose model comparison.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results showing PHR variations across different programming languages.
Custom Package Hallucination Test	Mean PHR (Python)	13.63	46.15	+32.52
Custom Package Hallucination Test	Mean PHR (Rust)	Not reported in the paper	24.74	Not reported in the paper
Results correlating model characteristics (size, benchmark performance) with hallucination rates.
HumanEval vs PHR	Pearson Correlation (ρ)	0	-0.7887	-0.7887
Model Size vs PHR	Pearson Correlation (ρ)	0	-0.542	-0.542
Custom Package Hallucination Test	PHR (Nemotron-Llama-3.1-70B)	24.40	0.22	-24.18

Experiment Figures

Box and whisker plot of Package Hallucination Rate (PHR) by programming language.

Scatter plot of Model Size (log scale) vs Package Hallucination Rate.

Scatter plot of HumanEval Score vs Package Hallucination Rate.

Main Takeaways

All models tested were vulnerable to package hallucination across all languages, though larger models (>70B) are significantly more resistant
JavaScript prompts result in fewer hallucinations than Python or Rust, likely due to the massive size of the NPM ecosystem (3.4M packages) reducing the space of unregistered names
There is no statistically significant difference in hallucination rates between specialized 'coding' models and general-purpose models (after removing outliers like GPT-4o)
Coding benchmark scores (specifically HumanEval) serve as a strong proxy for package security; if a model scores well on HumanEval, it likely has a lower Package Hallucination Rate

📚 Prerequisite Knowledge

Prerequisites

Understanding of software supply chain attacks (typosquatting)
Familiarity with package managers (pip, npm, cargo)
Basic knowledge of LLM hallucination

Key Terms

PHR: Package Hallucination Rate—the proportion of model prompts that result in at least one hallucinated package import

Package Hallucination: Instances where generated code imports external dependencies that do not exist in the official package repository or were registered after the model's knowledge cutoff

Induced Hallucination: When an LLM is explicitly asked to generate code using a non-existent package or API, forcing it to hallucinate

Natural Hallucination: When an LLM spontaneously produces a non-existent package import without being specifically asked for it in the prompt

Typosquatting: A cyberattack method where attackers register domain names or packages with names very similar to popular ones, hoping users make a typo

HumanEval: A benchmark dataset for evaluating code generation capabilities of LLMs, consisting of Python coding problems

garak: An LLM vulnerability scanning framework used in this paper to orchestrate the generation and checking of prompts