Library Hallucinations in LLMs: Risk Analysis Grounded in Developer Queries

📝 Paper Summary

Code Generation Hallucination Detection Software Security

The paper investigates how realistic developer prompt variations—specifically temporal constraints and minor typos—trigger LLMs to confidently hallucinate non-existent software libraries, exposing users to security risks.

Core Problem

LLMs frequently hallucinate external libraries and APIs during code generation, but the impact of realistic user behaviors (natural language variation, typos, time-based requests) on these error rates is unknown.

Why it matters:

Hallucinated imports break builds and waste developer time trying to debug non-existent dependencies
Malicious actors can exploit frequent hallucinations via 'slopsquatting' (registering fake packages) to compromise software supply chains
Existing evaluations use broad aggregate metrics, missing fine-grained triggers like typos that could amplify exposure to typosquatting attacks

Concrete Example: When a developer asks for a library 'from 2025', models like GPT-4o-mini hallucinate invalid libraries in 53.79% of tasks. Similarly, a simple typo like 'numpio' instead of 'numpy' causes GPT-5-mini to use the fake library 26% of the time instead of correcting it.

Key Novelty

Systematic Stress-Testing of Library Hallucinations under Realistic Noise

Simulates authentic developer intent by extracting common descriptors (e.g., 'lightweight', 'modern') from StackExchange to test their effect on model fidelity
Evaluates robustness to user errors by injecting controlled misspellings and fake names, revealing that models often 'sycophantically' accept fake libraries rather than correcting the user
Identifies a specific vulnerability to time-based prompts, where requesting recent libraries triggers massive hallucination spikes even within the model's knowledge window

Architecture

The evaluation pipeline showing how prompts are varied and how outputs are verified.

Evaluation Highlights

Time-related prompts are highly dangerous: requesting a library 'from 2025' triggers hallucinations in up to 84.74% of tasks (GPT-4o-mini)
Models are vulnerable to sycophancy: fake library names are accepted and used in up to 99% of tasks by GPT-5-mini
Simple one-character typos (e.g., 'panfas') cause hallucinations in up to 25.86% of tasks (GPT-5-mini), amplifying typosquatting risks

Breakthrough Assessment

7/10

Provides crucial empirical evidence connecting LLM behavior to security risks (slopsquatting). While not proposing a new architecture, the systematic analysis of prompt fragility is highly valuable for safety research.

⚙️ Technical Details

Problem Definition

Setting: Python code generation given a natural language prompt containing a specific library directive or constraint

Inputs: Natural language prompt P containing a task description and a library directive (e.g., 'using a fast library', 'using [FakeLib]')

Outputs: Python code block C containing import statements and function calls

Pipeline Flow

Prompt Generation (templates + user constraints)
LLM Inference (generate 3 responses per prompt)
Code Extraction (Regex & AST parsing)
Hallucination Detection (PyPI lookup & Documentation verification)

System Modules

Prompt Generator

Constructs queries by combining BigCodeBench tasks with specific variations (StackExchange descriptors, typos, fake names)

Model or implementation: Template-based injection

Code Generator

Generates Python code solutions based on the prompt

Model or implementation: 7 Evaluated LLMs (GPT-4o, GPT-5-mini, Claude-4.5, Llama-3.3, etc.)

AST Parser (Analysis)

Extracts library imports and member calls from generated code

Model or implementation: Python ast module

Validator (Analysis)

Verifies existence of libraries and members

Model or implementation: PyPI API + Documentation Scraper

Modeling

Base Model: Evaluation of 7 models: GPT-4o-mini, GPT-5-mini, Ministral-8B, Qwen2.5-Coder, Llama-3.3, DeepSeek-V3.1, Claude-4.5-Haiku

Training Method: No training performed (inference-only evaluation)

Comparison to Prior Work

vs. Spracklen et al.: Focuses on *triggers* (typos, time, descriptors) rather than just base rates
vs. Krishna et al.: Analyzes impact of user error/misspelling, revealing sycophancy risks
vs. Park: Provides empirical data on how LLMs amplify slopsquatting/typosquatting via compliance with bad prompts

Limitations

Limited to Python libraries only; results might differ for JavaScript (npm) or Java (Maven)
Focuses only on library name/member existence, not semantic correctness of usage
Member hallucination detection is conservative (ignores responses with valid version strings to avoid false positives from deprecated APIs)
Prompt engineering strategies tested are simple baselines, not exhaustive optimizations

Reproducibility

Code: https://github.com/itsluketwist/realistic-library-hallucinations

publicly available (https://github.com/itsluketwist/realistic-library-hallucinations). Includes the benchmark suite (LibraryHalluBench) derived from BigCodeBench, evaluation scripts, and regex patterns for extraction. Exact model API versions used are listed in Appendix A.

📊 Experiments & Results

Evaluation Setup

Zero-shot code generation on 356 Python tasks derived from BigCodeBench (ODEX-based), filtered to ensure external library usage.

Benchmarks:

LibraryHalluBench (Python code generation with specific constraints) [New]

Metrics:

Response Hallucination Rate (RHR): % of responses with a hallucination
Task Hallucination Rate (THR): % of tasks with at least one hallucinated response
Response Usage Rate (RUR): % of responses utilizing the suggested (potentially fake/typo'd) library
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiment 1 tests how realistic user descriptions (extracted from StackExchange) affect hallucination rates. Adjectives have little effect, but temporal prompts ('from 2025') cause massive failures.
LibraryHalluBench	Task Hallucination Rate (THR)	0.00	84.74	+84.74
LibraryHalluBench	Task Hallucination Rate (THR)	0.00	0.62	+0.62
Experiment 2 tests robustness to user errors (typos and fake names). Models show high sycophancy, accepting fake libraries instead of correcting them.
LibraryHalluBench	Task Usage Rate (TUR)	100.00	25.86	-74.14
LibraryHalluBench	Task Usage Rate (TUR)	100.00	99.22	-0.78
LibraryHalluBench	Task Usage Rate (TUR)	99.38	20.72	-78.66
Experiment 3 tests mitigation via prompt engineering. Reasoning prompts often backfire.
LibraryHalluBench	Response Hallucination Rate (RHR)	53.79	60.33	+6.54
LibraryHalluBench	Response Hallucination Rate (RHR)	53.79	40.60	-13.19

Main Takeaways

Models exhibit 'sycophancy' towards library names: they overwhelmingly accept fake library names (up to 99%) rather than correcting the user, amplifying supply chain risks.
Temporal grounding is broken: Asking for libraries 'from 2025' causes massive hallucination spikes (up to 84% failure rate), likely because models fabricate plausible-sounding 'modern' libraries when they lack recent knowledge.
Reasoning prompts are not a silver bullet: Chain-of-Thought and Step-Back prompting often increased hallucination rates (e.g., +15% for Qwen), suggesting they reinforce incorrect premises rather than verifying facts.
Code-specific models (Qwen-2.5-Coder) are sometimes more robust to user errors than larger general-purpose models (GPT-5-mini) regarding sycophancy, rejecting fake libraries more often.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Python packaging (PyPI, pip)
Understanding of LLM code generation
Basic knowledge of software supply chain security

Key Terms

Hallucination: In this context, generating code that imports a non-existent library (name hallucination) or calls a non-existent function from a valid library (member hallucination)

Slopsquatting: A supply chain attack where malicious actors register packages with names frequently hallucinated by LLMs to compromise developers who copy-paste code

Typosquatting: Registering packages with names very similar to popular libraries (e.g., 'requests' vs 'request') to catch users who make typing errors

Sycophancy: The tendency of an LLM to agree with or adopt the user's premise (even if incorrect), such as using a fake library name just because the user requested it

Chain-of-thought: A prompting strategy instructing the model to generate intermediate reasoning steps before the final answer ('Let's think step by step')

RAG: Retrieval-Augmented Generation—enhancing model outputs by retrieving relevant documents (e.g., documentation) from an external knowledge base

AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code, used here to parse imports and function calls