Evaluation Setup
SQL bug fixing on a data development platform (task-level SQL)
Benchmarks:
- Internal Evaluation Dataset (SQL Bug Fixing) [New]
Metrics:
- Fixing Accuracy (Manual Evaluation)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| PDC data significantly improves performance over base models, and DM-SFT further improves over standard SFT. |
| Internal Evaluation Dataset |
Accuracy |
28.5 |
43.8 |
+15.3
|
| Internal Evaluation Dataset |
Accuracy |
43.8 |
49.8 |
+6.0
|
| Internal Evaluation Dataset |
Accuracy |
42.6 |
49.3 |
+6.7
|
| Internal Evaluation Dataset |
Accuracy |
43.9 |
49.7 |
+5.8
|
Main Takeaways
- PDC (Diverse Collecting + Oriented Generation) provides a massive boost (+50% relative) over zero-shot baselines by aligning training data with real-world user error patterns.
- DM-SFT consistently outperforms standard SFT (~10% relative improvement) across multiple base models (DeepSeek-Coder, CodeQwen, DeepSeek-V2), proving the effectiveness of masking unchanged code.
- Higher mask ratios (p=0.4 to 0.7) allow the model to converge faster by increasing the per-token loss weight of the actual bug fixes (diff lines).