Smart-LLaMA: Two-Stage Post-Training of Large Language Models for Smart Contract Vulnerability Detection and Explanation

📝 Paper Summary

Smart Contract Security Vulnerability Detection Explainable AI (XAI) for Code

Smart-LLaMA adapts a general LLM for smart contract security via domain-specific pre-training and fine-tuning on a novel dataset of code-explanation pairs to detect and explain vulnerabilities.

Core Problem

Existing smart contract vulnerability detection methods suffer from poor dataset quality (lacking explanations/locations), general LLMs' inability to grasp contract-specific semantics, and a lack of explainability in detection results.

Why it matters:

Smart contract vulnerabilities (e.g., reentrancy) cause massive financial losses (e.g., the $60M DAO incident), making accurate detection critical
Developers cannot easily fix bugs if tools only output binary labels without explaining 'why' or 'where', slowing down remediation
Current datasets are limited in scope and detail, causing models to overfit simplistic features rather than learning true vulnerability patterns

Concrete Example: When analyzing a contract with a `gotake()` function involving an external call, a general LLaMA-3.1-8B-Instruct model incorrectly flags it as a reentrancy vulnerability. It fails to recognize that no state changes occur after the call, which makes the contract safe. Smart-LLaMA correctly identifies it as safe because it understands the specific state-change pre-requisites for reentrancy.

Key Novelty

Two-Stage Post-Training with Explanation-Guided Dataset

Constructs a high-quality dataset by using large teacher LLMs (Qwen2, Mistral) to generate explanations, filtering them via an LLM judge, and verifying with human experts
Applies 'Smart Contract-Specific Continual Pre-Training' on raw contract code to adapt the model to Solidity syntax and semantics before fine-tuning
Uses 'Explanation-Guided Fine-Tuning' where the model learns to output both a detection label and a detailed reasoning chain, bridging the gap between detection and understanding

Architecture

The overall workflow of Smart-LLaMA, illustrating the data collection, annotation, and two-stage training process

Evaluation Highlights

Outperforms state-of-the-art methods by +7.35% F1 score on Reentrancy vulnerabilities and +9.55% F1 on Delegatecall vulnerabilities
Achieves superior accuracy across all 4 vulnerability types, improving by +4.14% to +5.53% over previous best performers
Human experts rated Smart-LLaMA's explanations as 'perfect' (4/4) for correctness in 69.5% of cases, significantly higher than the baseline LLaMA-3.1-8B-Instruct

Breakthrough Assessment

8/10

Significant contribution in dataset quality (explanations+locations) and a robust pipeline combining domain adaptation with instruction tuning. Strong empirical gains on difficult vulnerability types.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of smart contracts for specific vulnerability types, accompanied by natural language generation of explanations

Inputs: Smart contract source code

Outputs: Label y (1=vulnerable, 0=safe) and a text explanation detailing vulnerability type, location, and impact

Pipeline Flow

Input Processing (Smart Contract Code)
Smart-LLaMA Inference (Vulnerability Detection & Explanation Generation)
Output Formatting (Label + Explanation)

System Modules

Smart-LLaMA

Analyze code semantics to detect vulnerabilities and generate human-readable explanations

Model or implementation: LLaMA-3.1-8B (Fine-tuned)

Modeling

Base Model: LLaMA-3.1-8B

Training Method: Two-Stage Training: Continual Pre-Training + Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Learn domain syntax during pre-training.

Formally: Standard Causal Language Modeling loss (minimize negative log-likelihood of next token prediction)
Purpose: Learn detection and explanation during fine-tuning.

Formally: Standard Cross-Entropy Loss on the generated sequence (explanation + label)

Training Data:

Dataset curation: 3-step pipeline (Generation -> LLM Judge -> Human Verify)
Teacher Models: Qwen2-72B-Instruct and Mistral-Large-Instruct-2407-123B used to generate initial explanations
Judge Model: Llama-3.1-70B-Instruct used to score explanations (1-10)
Final SFT Dataset: 3,382 Reentrancy, 1,165 Timestamp, 1,005 Overflow, 697 Delegatecall samples

Key Hyperparameters:

context_loss_type: Context-based language model loss

Compute: Not reported in the paper

Comparison to Prior Work

vs. Static Tools (Slither, Oyente): Smart-LLaMA handles complex logic better and provides reasoning, whereas static tools rely on rigid patterns and have high false positives
vs. Peculiar/PSCVFinder: Smart-LLaMA provides detailed explanations and location pointing, whereas these methods only provide binary labels
vs. GPT-4/General LLMs [not cited in paper]: Smart-LLaMA is specifically adapted via continual pre-training to understand Solidity semantics, reducing domain-specific hallucinations common in general models

Limitations

Dataset size for some vulnerability types (e.g., Delegatecall with 697 samples) is relatively small
The approach relies on the quality of teacher LLMs (Qwen2/Mistral) for initial training data generation
Computational cost of the two-stage training process is likely higher than simple fine-tuning

Reproducibility

Code: https://zenodo.org/records/13860344

publicly available (https://zenodo.org/records/13860344). The repository contains models, datasets, and code. The paper details the prompt engineering process for dataset creation.

📊 Experiments & Results

Evaluation Setup

Vulnerability detection on a dataset of Ethereum smart contracts

Benchmarks:

Custom Dataset (derived from SmartBugs, Etherscan) (Binary Classification + Explanation Generation) [New]

Metrics:

F1 Score
Accuracy
Human Evaluation (Likert scale 1-4 for Correctness, Completeness, Conciseness)
LLM Evaluation (Likert scale 1-4)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Smart-LLaMA consistently outperforms state-of-the-art baselines in F1 score across all four tested vulnerability types.
Custom Dataset	F1 improvement (Reentrancy)	Not explicitly reported in the paper	Not explicitly reported in the paper	+7.35
Custom Dataset	F1 improvement (Delegatecall)	Not explicitly reported in the paper	Not explicitly reported in the paper	+9.55
Custom Dataset	F1 improvement (Integer Overflow)	Not explicitly reported in the paper	Not explicitly reported in the paper	+7.82
Custom Dataset	Accuracy improvement (Reentrancy)	Not explicitly reported in the paper	Not explicitly reported in the paper	+4.14

Experiment Figures

A comparison case study between LLaMA-3.1-8B-Instruct and Smart-LLaMA on a Reentrancy vulnerability example

Main Takeaways

Domain-specific continual pre-training significantly helps the model understand contract logic, reducing false positives caused by misunderstanding external calls
Explanation-guided fine-tuning enables the model to not just detect but explain vulnerabilities, with human experts rating the explanations as highly correct and complete
The rigorous dataset construction pipeline (Teacher LLM -> Judge LLM -> Human) is effective for creating high-quality training data in specialized domains where data is scarce

📚 Prerequisite Knowledge

Prerequisites

Smart Contract Basics (Solidity, Ethereum)
Common Vulnerabilities (Reentrancy, Timestamp Dependency)
Large Language Model Training (Pre-training vs. Fine-tuning)

Key Terms

Reentrancy: A vulnerability where an external contract call interrupts execution and calls back into the original function before state updates are finalized

Delegatecall: A low-level Solidity function that executes code from another contract in the context of the calling contract, risking storage manipulation

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs

Continual Pre-Training: Further training a pre-trained model on domain-specific raw text (here, smart contracts) to improve domain adaptation

Explanation-Guided Fine-Tuning: Fine-tuning the model to generate both a classification label and a reasoning explanation, supervising the 'thought process'

SmartBugs: A widely used dataset and framework for smart contract vulnerability analysis