From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

📝 Paper Summary

LLM Post-training AI for Software Engineering

This paper establishes a comprehensive post-training pipeline for vulnerability detection, demonstrating that on-policy RL with fine-grained root-cause rewards significantly outperforms static supervision and binary evaluation metrics.

Core Problem

Existing vulnerability detection approaches rely on static supervision (SFT/DPO) that limits self-exploration and use coarse-grained binary rewards that fail to capture reasoning quality.

Why it matters:

Binary outcome rewards often mislead training by crediting correct guesses derived from flawed logic (reward hacking)
Static datasets for SFT and DPO become outdated as the policy improves, preventing the model from learning from its own successful rollouts
Lack of context in function-level datasets leads to semantic blindness, causing hallucinations about missing sanitizers or trigger conditions

Concrete Example: A student model trained via rationalization (where the teacher sees the ground truth) may hallucinate irrelevant CVE IDs because it learns to conflate external knowledge with internal reasoning, unlike a model trained via rejection sampling.

Key Novelty

Full-Pipeline Optimization for Vulnerability Detection (SFT → RL)

First systematic investigation of the complete post-training pipeline for vulnerability detection, comparing SFT, off-policy preference optimization (DPO/ORPO), and on-policy RL (GRPO)
Introduction of difficulty-aware data curation (filtering and curriculum scheduling) to address reward sparsity in reinforcement learning
Development of multi-granularity reward systems ranging from binary detection to fine-grained, specification-based root cause analysis using LLM-as-a-Judge

Architecture

The on-policy RL pipeline using GRPO with difficulty-aware data curation mechanisms.

Breakthrough Assessment

8/10

Comprehensive analysis of the SFT-to-RL pipeline for a specific domain (security). The shift from binary matching to root-cause rewards and the application of GRPO to vulnerability detection are significant methodological advancements.

⚙️ Technical Details

Problem Definition

Setting: Vulnerability Reasoning: Given a code snippet and context, determine if it is vulnerable and provide a chain-of-thought analysis

Inputs: Vulnerability query q containing code and inter-procedure context

Outputs: Response o containing vulnerability detection label and reasoning

Pipeline Flow

Data Curation: Context Extraction (Joern) → SFT Data Generation
Cold-Start: SFT Training (Rejection Sampling)
RL Training: Difficulty-aware Scheduling → GRPO Optimization

System Modules

Context Extractor

Enrich function-level code with inter-procedure context

Model or implementation: Joern (Static Analysis Tool)

SFT Trainer

Establish fundamental vulnerability detection capabilities

Model or implementation: Student LLM

RL Policy Optimizer (On-Policy RL)

Optimize policy using online feedback

Model or implementation: GRPO (Group Relative Policy Optimization)

Reward Model (On-Policy RL)

Evaluate generated reasoning chains

Model or implementation: LLM-as-a-Judge

Novel Architectural Elements

Integration of difficulty-aware data filtering (removing samples with pass@1 of 0 or 1) into the GRPO loop to prevent gradient vanishing
Multi-granularity reward system combining detection, prediction, reasoning, and specification-based signals

Modeling

Base Model: Not reported in the paper

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy based on group-relative advantages without a value network.

Formally: E[min(ratio * A, clip(ratio, 1-eps, 1+eps) * A) - beta * KL]
Purpose: Calculate advantage by normalizing rewards within a sampled group.

Formally: A_i = (r_i - mean(r_group)) / std(r_group)

Training Data:

OpenVul Dataset: Context-aware dataset enriched via Joern
SFT Data: Curated via Rejection Sampling (8 samples per query, filtered by Judge)
RL Data: Filtered to remove trivial/impossible samples (pass@1 != 0 and != 1)

Key Hyperparameters:

num_generations_per_prompt: 8 (for rejection sampling and GRPO groups)
reward_value_correct: +1.0
reward_value_incorrect: -1.0

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT (e.g., VulDe-style): Introduces on-policy RL (GRPO) to allow self-exploration beyond static datasets
vs. DPO/ORPO approaches: Demonstrates that off-policy methods are limited by static supervision, whereas GRPO enables finding novel reasoning paths
vs. Binary-label classifiers: Uses LLM-as-a-Judge for reasoning-level rewards, reducing false positives from correct guesses with wrong logic

Limitations

Specification-based rewards require significant manual effort and inference overhead to generate sample-specific rubrics
Filtering hard-to-detect samples improves efficiency but results in model performance loss on those specific edge cases
LLM-as-a-Judge accuracy varies depending on the judge model's security expertise

Reproducibility

Code: https://github.com/youpengl/OpenVul

Code and datasets are publicly available at https://github.com/youpengl/OpenVul and HuggingFace. Specific base model architecture and training compute resources are not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Evaluation of vulnerability detection and reasoning accuracy using multiple granularity levels.

Benchmarks:

OpenVul (Context-aware) (Vulnerability Detection & Root Cause Analysis) [New]

Metrics:

Detection-level Match (Binary Accuracy)
Prediction-level Match (CWE Hierarchy)
Reasoning-level Match (LLM-as-a-Judge Root Cause Analysis)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

SFT based on Rejection Sampling significantly outperforms Rationalization; the latter causes hallucinations (e.g., inventing CVE IDs) because the student model conflates external ground-truth knowledge with internal reasoning.
On-policy RL (GRPO) significantly outperforms SFT and off-policy methods (DPO, ORPO), as well as zero-shot SOTA LLMs, by allowing the model to explore complex reasoning paths.
A 'Goldilocks' zone exists for SFT intensity: sufficient SFT helps off-policy optimization, but excessive SFT hinders on-policy RL by limiting self-exploration.
Coarse-grained binary rewards are misleading; fine-grained root-cause rewards are essential for reliable credit assignment in reinforcement learning.
Traditional binary evaluation metrics overestimate performance; LLM-as-a-Judge provides a more robust assessment of whether the model actually understands the vulnerability root cause.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Vulnerability Detection (VD) concepts
Static Analysis basics

Key Terms

SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality instruction-response pairs

GRPO: Group Relative Policy Optimization—an on-policy RL algorithm that normalizes rewards within a group of outputs for the same query, removing the need for a separate value function

DPO: Direct Preference Optimization—an off-policy method that optimizes a model to prefer chosen responses over rejected ones without an explicit reward model

ORPO: Odds Ratio Preference Optimization—a method combining SFT and preference optimization into a single objective to eliminate the reference model

CWE: Common Weakness Enumeration—a community-developed list of software and hardware weakness types

CPG: Code Property Graph—a graph representation of code combining abstract syntax trees, control flow graphs, and program dependence graphs

Joern: A static analysis tool used to generate Code Property Graphs (CPGs) for extracting code context

Rationalization: A data curation method where a teacher model generates reasoning while having access to the ground truth answer (prone to shortcuts)

Rejection Sampling: A data curation method where a teacher model generates multiple responses without ground truth, and only correct ones are kept (prevents hallucinations)