PDC & DM-SFT: A Road for LLM SQL Bug-Fix Enhancing

📝 Paper Summary

Code Large Language Models Automated Program Repair (APR)

PDC & DM-SFT enhances SQL bug-fixing in LLMs by mining real-world user repair patterns and training with a dynamic masking loss that focuses the model on code edits rather than unchanged lines.

Core Problem

Existing Code LLMs struggle with SQL bug repair because standard training treats all tokens equally, causing the loss function to be dominated by the large volume of unchanged code rather than the critical fixes.

Why it matters:

SQL bugs often involve complex nested query structures that are harder to fix than errors in other languages
Debugging SQL at the task level is time-consuming and computationally expensive, making efficient automated repair highly valuable
Default generative fine-tuning leads to 'disorientation' and slow convergence because >92% of the code in bug-fixing samples remains identical to the input

Concrete Example: In a SQL bug-fixing scenario, the input might be a 100-line query with a single syntax error. The correct output is nearly identical to the input. Standard SFT trains the model primarily to copy the 99 correct lines, diluting the signal for the 1 line that needs repair.

Key Novelty

Progressive Dataset Construction (PDC) & Dynamic Mask Supervised Fine-tuning (DM-SFT)

PDC builds training data via 'breadth-first' mining of online user edit logs (extracting successful human fixes) and 'depth-first' synthetic generation of hard/rare bug types using Code LLMs
DM-SFT modifies the training loss by randomly masking out lines of code that remain unchanged between input and output, forcing the model to focus its learning capacity on the 'diff' lines (the actual fixes)

Architecture

Comparison of Standard SFT vs. Dynamic Mask SFT loss calculation strategies.

Evaluation Highlights

Training with PDC data improves DeepSeek-Coder-6.7B-instruct's bug-fixing accuracy by over 50% relative to the base model (28.5% -> 43.8%)
DM-SFT further improves performance by ~10% relative to standard SFT (e.g., DeepSeek-Coder-6.7B-instruct improves from 43.8% to 49.8%)
DM-SFT accelerates convergence, requiring fewer training steps to reach optimal performance compared to default SFT

Breakthrough Assessment

7/10

Simple yet highly effective domain-specific adaptation of SFT for editing tasks. The methodology for mining user logs (PDC) is practical, and the dynamic masking (DM-SFT) addresses a fundamental inefficiency in training repair models.

⚙️ Technical Details

Problem Definition

Setting: Given a buggy SQL query, table schemas, and an error message, generate the corrected SQL query.

Inputs: Prompt P = [Tables DDL, Buggy SQL code, Error Message]

Outputs: Corrected SQL code SQL_correct

Pipeline Flow

Data Collection (PDC): Mine user logs + Generate synthetic samples
Training (DM-SFT): Fine-tune Code LLM with dynamic loss masking
Inference: Input [Schema, Bug, Error] -> Model -> Fixed SQL

System Modules

Progressive Dataset Construction (PDC)

Curate training data from user behaviors and synthetic generation

Dynamic Mask SFT Trainer

Fine-tune the model using a loss function that prioritizes changed lines

Model or implementation: DeepSeek-Coder-6.7B-instruct (Base)

Novel Architectural Elements

Loss calculation mechanism that dynamically categorizes output lines as 'consistent' or 'diff' and applies differential masking weights

Modeling

Base Model: DeepSeek-Coder-6.7B-instruct

Training Method: Dynamic Mask Supervised Fine-tuning (DM-SFT)

Objective Functions:

Purpose: Calculate loss for consistent (unchanged) lines with dynamic masking.

Formally: L1 = - sum( a(l_i) * log P(u_k | Context) ) where a(l_i) is 0 with probability p, else 1.
Purpose: Calculate full loss for diff (changed) lines.

Formally: L2 = - sum( log P(v_k | Context) )
Purpose: Total training loss.

Formally: L = (L1 + L2) / Total_Unmasked_Tokens

Training Data:

3,000 samples from Diverse Collecting (user logs)
300+ samples from Oriented Generation (synthetic)
Total ~3.3k training samples

Key Hyperparameters:

learning_rate: 1.2e-5
batch_size: 32
optimizer: AdamW
+ 3 more
adam_beta1: 0.9
adam_beta2: 0.95
mask_ratio_p: 0.4 to 0.7 (optimal range)

Compute: 32 × NVIDIA A800 80GB GPUs

Comparison to Prior Work

vs. Standard SFT: DM-SFT masks loss for unchanged code lines to focus gradients on the edits (bug fixes)
vs. BUGLAB/Break-It-Fix-It: Uses real-world user behavior logs for ground truth rather than just synthetic bug injection [cited in paper]

Limitations

Relies on proprietary internal data logs for the primary 'Diverse Collecting' dataset
Manual evaluation is resource-intensive, limiting the scale of the test set (1,072 samples)
Focuses specifically on SQL; applicability to other programming languages is implied but not tested

Reproducibility

Training data is proprietary (mined from internal Bytedance data platform logs). Code availability is not provided in the text. The method (DM-SFT) is described with equations and logic suitable for reimplementation if data were available.

📊 Experiments & Results

Evaluation Setup

SQL bug fixing on a data development platform (task-level SQL)

Benchmarks:

Internal Evaluation Dataset (SQL Bug Fixing) [New]

Metrics:

Fixing Accuracy (Manual Evaluation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PDC data significantly improves performance over base models, and DM-SFT further improves over standard SFT.
Internal Evaluation Dataset	Accuracy	28.5	43.8	+15.3
Internal Evaluation Dataset	Accuracy	43.8	49.8	+6.0
Internal Evaluation Dataset	Accuracy	42.6	49.3	+6.7
Internal Evaluation Dataset	Accuracy	43.9	49.7	+5.8

Experiment Figures

Model bug-fixing accuracy as a function of the random mask ratio p.

Training loss curves for different mask ratios.

Main Takeaways

PDC (Diverse Collecting + Oriented Generation) provides a massive boost (+50% relative) over zero-shot baselines by aligning training data with real-world user error patterns.
DM-SFT consistently outperforms standard SFT (~10% relative improvement) across multiple base models (DeepSeek-Coder, CodeQwen, DeepSeek-V2), proving the effectiveness of masking unchanged code.
Higher mask ratios (p=0.4 to 0.7) allow the model to converge faster by increasing the per-token loss weight of the actual bug fixes (diff lines).

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) of LLMs
SQL (Structured Query Language) syntax and execution
Loss masking in Language Models

Key Terms

DM-SFT: Dynamic Mask Supervised Fine-tuning—a training method where the loss for unchanged code lines is probabilistically masked to emphasize the lines that changed (the bug fix)

PDC: Progressive Dataset Construction—a data collection strategy combining mining of real-world user logs (breadth) and synthetic generation of targeted bug types (depth)

Diff lines: Lines of code that differ between the buggy input and the correct output (i.e., the lines where the fix occurs)

Consistent lines: Lines of code that are identical in both the buggy input and the correct output

DDL: Data Definition Language—SQL commands that define the database schema (e.g., CREATE TABLE)

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset