← Back to Paper List

CodeSimpleQA: Scaling Factuality in Code Large Language Models

J Yang, W Zhang, Y Li, S Guo, H Wang, A Liu, G Zhang…
Zhejiang University, Alibaba Group, University of Waterloo
arXiv, 12/2025 (2025)
Factuality Benchmark RL QA

📝 Paper Summary

Factual Knowledge Evaluation Code Question Answering Post-training for Factuality
CodeSimpleQA introduces a bilingual benchmark and a massive instruction dataset to evaluate and improve the factual accuracy of code LLMs, demonstrating that RL post-training significantly enhances factuality.
Core Problem
Current code LLM benchmarks focus on code execution correctness but overlook the factual accuracy of programming concepts, leading to plausible-sounding but incorrect technical answers.
Why it matters:
  • Factual inaccuracies in coding assistants can lead to bugs, security vulnerabilities, or inefficient implementations when developers rely on them for technical concepts
  • Existing factuality benchmarks (e.g., SimpleQA) focus on general world knowledge, leaving a gap in evaluating specialized software development knowledge across diverse languages and domains
  • Even frontier models like GPT-4o struggle with precise technical facts, often hallucinating deprecated APIs or incorrect syntax details
Concrete Example: When asked about a specific Android lifecycle method like 'onRestart()', a model might hallucinate details from a deprecated version or confuse it with 'onResume()', whereas the benchmark requires an answer grounded in official documentation.
Key Novelty
CodeSimpleQA Benchmark & CodeSimpleQA-RL Framework
  • Creates a rigorous bilingual benchmark (English/Chinese) where every QA pair is grounded in official documentation and verified by human experts, moving beyond heuristic evaluation
  • Develops a massive 66.9M sample synthetic instruction dataset via a structured pipeline: document recall → knowledge clustering → QA generation → LLM-as-a-Judge verification
  • Applies Group Relative Policy Optimization (GRPO) specifically for factuality, using an LLM-based reward signal to align model outputs with ground-truth facts
Evaluation Highlights
  • CodeSimpleQA-RL achieves 45.2% F-score on Chinese tasks, significantly outperforming the base Qwen2.5-Coder-32B-Instruct (37.1%)
  • DeepSeek-V3 leads open-source models with 49.3% F-score in Chinese, surpassing Qwen2.5-Coder-32B-Instruct by a large margin
  • Proprietary model GPT-5 achieves the highest F-score of 62.9% on the English split, demonstrating a continued gap between open and closed models
Breakthrough Assessment
8/10
Addresses a critical, under-explored gap in code LLM evaluation (factuality vs. execution). The release of a 66M sample dataset and proof of RL effectiveness for factuality is a significant contribution.
×