Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities

📝 Paper Summary

Software Vulnerability Analysis LLM Prompt Engineering

VSP is a prompting strategy that guides LLMs to reason about vulnerabilities by explicitly mapping 'vulnerability semantics'—critical statements and their data/control flow context—into chain-of-thought steps.

Core Problem

Deep learning approaches for vulnerability analysis suffer from data scarcity and poor generalization, while standard LLM prompting fails because models get distracted by irrelevant code or lack structured reasoning.

Why it matters:

Software vulnerabilities cause critical financial losses and data breaches, with over 22,000 reported in 2023 alone
Existing DL tools perform poorly on real-world code (low recall/precision) due to lack of high-quality labeled datasets
Purely static analysis suffers from high false alarms, while dynamic analysis is limited by input coverage

Concrete Example: In a Use-After-Free scenario, a standard model might miss the vulnerability if it simply scans code line-by-line. VSP forces the model to first locate the pointer usage (line 8), then trace back the specific data flow (line 2) and control flow (line 6) that cause the error, ignoring irrelevant lines.

Key Novelty

Vulnerability-Semantics-guided Prompting (VSP)

Redefines Chain-of-Thought for code by using 'vulnerability semantics'—the specific subset of code (vulnerable statements + relevant control/data dependencies) that accounts for the flaw
Constructs few-shot exemplars where the 'reasoning' step explicitly isolates these semantic elements before outputting a verdict, preventing the LLM from processing the entire code block uniformly

Architecture

The unified workflow of VSP showing how vulnerability semantics guide the prompting for three different tasks

Evaluation Highlights

+553.3% improvement in F1 accuracy for vulnerability identification on the real-world CVE dataset compared to the best baseline
Found 22 true zero-day vulnerabilities in real-world software with 40.00% accuracy, compared to only 9 found by standard prompting
Achieved 97.65% F1 on synthetic patching tasks and 20.00% on real-world patching, outperforming all non-CoT baselines

Breakthrough Assessment

8/10

Demonstrates a massive jump in performance on real-world datasets by specializing generic CoT for code semantics. The identification of 22 zero-days validates practical utility.

⚙️ Technical Details

Problem Definition

Setting: Software Vulnerability Analysis across three sub-tasks: Identification, Discovery, and Patching

Inputs: Source code snippet (text)

Outputs: Binary label (Identification), CWE ID (Discovery), or Patched code snippet (Patching)

Pipeline Flow

Prompt Construction (Few-shot selection + VSP Formatting)
LLM Inference (Reasoning generation + Final Answer)

System Modules

Prompt Constructor

Retrieves few-shot exemplars and structures the input prompt

Model or implementation: N/A (Rule-based)

Vulnerability Reasoner

Generates the chain of thought and final prediction

Model or implementation: GPT-3.5-Turbo / Llama-2-7b-chat-hf / Falcon-7b-instruct

Novel Architectural Elements

Integration of static analysis concepts (control/data flow dependencies) directly into the natural language 'thought' structure of CoT prompts

Modeling

Base Model: GPT-3.5-Turbo (primary), Llama-2-7b-chat-hf, Falcon-7b-instruct

Compute: Experiments run on AMD Ryzen Threadripper Pro 5595WX (64 Cores), 4x Nvidia GeForce RTX A6000 GPU, 512GB memory

Comparison to Prior Work

vs. Deep Learning: VSP does not require training/fine-tuning or large datasets
vs. Standard Prompting: VSP forces the model to identify specific code dependencies before answering, reducing hallucinations
vs. Naive CoT: VSP focuses only on relevant slices (vulnerability semantics) rather than explaining every line, reducing context noise
+ 1 more
vs. PEARL [not cited in paper]: PEARL also uses CoT for vulnerability repair but focuses on performance-guided refinement rather than semantics-guided initial analysis

Limitations

Performance drops significantly when code context is insufficient (e.g., missing external function definitions)
LLMs still occasionally miss important control flow facts or data flow facts during reasoning
Manual construction of high-quality VSP exemplars requires expert knowledge
Evaluation limited to C/C++ and 5 specific CWE types

Reproducibility

Code and datasets stated to be available at Figshare (URL not explicitly listed in text). Uses public models (OpenAI API, HuggingFace). Exemplars (20 pairs) are manually constructed.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting (20 exemplars) on three distinct vulnerability tasks

Benchmarks:

SARD (Synthetic vulnerability dataset)
CVE Dataset (Fan et al.) (Real-world vulnerability dataset)

Metrics:

F1 score
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Vulnerability Identification results show massive gains on real-world CVE data using VSP compared to baselines.
CVE Dataset	F1	8.96	58.48	+49.52
SARD Dataset	F1	56.28	65.29	+9.01
Vulnerability Discovery and Patching results further confirm VSP's superiority, especially on complex real-world cases.
CVE Dataset	F1 (Discovery)	33.16	45.25	+12.09
CVE Dataset	F1 (Patching)	15.29	20.00	+4.71
Zero-day Discovery	Count of True Positives	9	22	+13

Experiment Figures

Concrete examples of VSP prompts for different tasks, illustrating how reasoning steps are formatted

Main Takeaways

VSP significantly outperforms standard and naive CoT prompting, particularly on real-world CVE data where context is complex
The method generalizes across different LLMs (GPT-3.5, Llama2, Falcon), though GPT-3.5 generally performs best
Qualitative analysis shows 'insufficient code context' is the primary failure mode for both identification and patching
The approach is capable of discovering zero-day vulnerabilities, highlighting its utility beyond just reproducing known benchmarks

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Basic knowledge of software vulnerabilities (CWEs)
Familiarity with static analysis concepts (control flow, data flow)

Key Terms

VSP: Vulnerability-Semantics-guided Prompting—the paper's proposed method that structures prompts around code dependencies relevant to vulnerabilities

CWE: Common Weakness Enumeration—a community-developed list of software and hardware weakness types

CVE: Common Vulnerabilities and Exposures—a list of publicly disclosed computer security flaws

vulnerability semantics: The subset of program dependencies (data flow and control flow) that causally account for a specific vulnerable behavior

CoT: Chain-of-Thought—a prompting technique where the model is shown intermediate reasoning steps to improve complex problem solving

SARD: Software Assurance Reference Dataset—a synthetic dataset commonly used for testing vulnerability detection tools

zero-day: A vulnerability that is unknown to the software vendor or public, meaning no patch exists yet