CodeSimpleQA: Scaling Factuality in Code Large Language Models

📝 Paper Summary

Factual Knowledge Evaluation Code Question Answering Post-training for Factuality

CodeSimpleQA introduces a bilingual benchmark and a massive instruction dataset to evaluate and improve the factual accuracy of code LLMs, demonstrating that RL post-training significantly enhances factuality.

Core Problem

Current code LLM benchmarks focus on code execution correctness but overlook the factual accuracy of programming concepts, leading to plausible-sounding but incorrect technical answers.

Why it matters:

Factual inaccuracies in coding assistants can lead to bugs, security vulnerabilities, or inefficient implementations when developers rely on them for technical concepts
Existing factuality benchmarks (e.g., SimpleQA) focus on general world knowledge, leaving a gap in evaluating specialized software development knowledge across diverse languages and domains
Even frontier models like GPT-4o struggle with precise technical facts, often hallucinating deprecated APIs or incorrect syntax details

Concrete Example: When asked about a specific Android lifecycle method like 'onRestart()', a model might hallucinate details from a deprecated version or confuse it with 'onResume()', whereas the benchmark requires an answer grounded in official documentation.

Key Novelty

CodeSimpleQA Benchmark & CodeSimpleQA-RL Framework

Creates a rigorous bilingual benchmark (English/Chinese) where every QA pair is grounded in official documentation and verified by human experts, moving beyond heuristic evaluation
Develops a massive 66.9M sample synthetic instruction dataset via a structured pipeline: document recall → knowledge clustering → QA generation → LLM-as-a-Judge verification
Applies Group Relative Policy Optimization (GRPO) specifically for factuality, using an LLM-based reward signal to align model outputs with ground-truth facts

Evaluation Highlights

CodeSimpleQA-RL achieves 45.2% F-score on Chinese tasks, significantly outperforming the base Qwen2.5-Coder-32B-Instruct (37.1%)
DeepSeek-V3 leads open-source models with 49.3% F-score in Chinese, surpassing Qwen2.5-Coder-32B-Instruct by a large margin
Proprietary model GPT-5 achieves the highest F-score of 62.9% on the English split, demonstrating a continued gap between open and closed models

Breakthrough Assessment

8/10

Addresses a critical, under-explored gap in code LLM evaluation (factuality vs. execution). The release of a 66M sample dataset and proof of RL effectiveness for factuality is a significant contribution.

⚙️ Technical Details

Problem Definition

Setting: Short-form factual question answering in the domain of computer science and software engineering

Inputs: Natural language question q about a programming concept or fact

Outputs: Short factual response a, strictly evaluated against a ground truth reference

Pipeline Flow

Data Construction: Crawl → Filter/Cluster → QA Generation → Verification
Post-Training: SFT → RL (GRPO)

System Modules

Document Filter (Data Construction)

Identify code-related documents from Common Crawl and filter based on domain score and code-text ratio

Model or implementation: fastText classifier + rule-based filters

QA Generator (Data Construction)

Synthesize factual QA pairs from document clusters

Model or implementation: DeepSeek-V3.1 (low temperature 0.1)

Judge/Verifier (Data Construction)

Verify factual correctness of generated QA pairs against source documents

Model or implementation: LLM-as-a-Judge (DeepSeek-V3.1)

Policy Model

Generate answers to code questions

Model or implementation: Qwen2.5-Coder-32B-Instruct

Novel Architectural Elements

Integration of LLM-as-a-Judge verification directly into the GRPO reward loop for factual code knowledge alignment

Modeling

Base Model: Qwen2.5-Coder-32B-Instruct

Training Method: SFT followed by GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize the policy to maximize reward relative to the group average.

Formally: GRPO objective maximizes expected advantage where advantage is normalized within the group of K sampled outputs.
Purpose: Maintain proximity to the reference policy.

Formally: KL divergence penalty D_KL(π_θ || π_ref) included in the loss.

Adaptation: Full fine-tuning

Training Data:

66.9 Million samples in CodeSimpleQA-Instruct (53.6M English, 13.4M Chinese)
1,498 curated samples in CodeSimpleQA Test Set

Key Hyperparameters:

learning_rate_sft: 6e-5
learning_rate_rl: 5e-7
batch_size_sft: 1024
+ 3 more
batch_size_rl: 1024 queries
rl_group_size: 8 trajectories per group
warmup_steps: 100

Compute: SFT: 32 NVIDIA H20 GPUs. RL: 64 GPUs in FSDP mode.

Comparison to Prior Work

vs. SimpleQA: Focuses strictly on computer science/coding domain rather than general knowledge
vs. Chinese SimpleQA: Provides bilingual (En/Zh) support and specific code-domain coverage
vs. Existing Code Benchmarks (HumanEval, MBPP) [not cited in paper]: Focuses on factual knowledge (APIs, concepts) rather than code generation/execution correctness

Limitations

Evaluation set (1,498 samples) is relatively small compared to the training set size
Limited to English and Chinese, excluding other languages
Focuses on short-form QA, ignoring complex reasoning or long-form code generation
Relies on LLM-as-a-Judge, which may inherit biases from the judge model

Reproducibility

Code: https://github.com/microsoft/CodeSimpleQA

CodeSimpleQA benchmark and dataset construction methodology promised to be released. CodeSimpleQA-Instruct dataset (66.9M samples) is described as part of the contribution. Model Qwen2.5-Coder-32B-Instruct is open weights.

📊 Experiments & Results

Evaluation Setup

Short-form QA evaluation where models generate answers constrained to <64 words. Evaluated against human-verified reference answers.

Benchmarks:

CodeSimpleQA (Factual Question Answering) [New]

Metrics:

F-score (Harmonic mean of Correct and Correct Given Attempted)
Correct (CO)
Not Attempted (NA)
Incorrect (IN)
Correct Given Attempted (CGA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on the Chinese split of CodeSimpleQA show proprietary models leading, with RL post-training providing significant boosts to open models.
CodeSimpleQA (Chinese)	F-score	37.1	45.2	+8.1
CodeSimpleQA (Chinese)	F-score	61.3	45.2	-16.1
Results on the English split generally show higher scores but similar ranking trends, with GPT-5 dominating.
CodeSimpleQA (English)	F-score	32.6	41.0	+8.4
CodeSimpleQA (English)	F-score	62.9	41.0	-21.9
Analysis of RAG vs. RL shows trade-offs between dynamic and static knowledge.
CodeSimpleQA (Post-2024 subset)	Score	24.0	68.0	+44.0

Experiment Figures

Test-time scaling (Pass@k) performance as inference budget increases from 1 to 100 attempts.

Scaling laws comparing 'Thinking' (reasoning) models vs. 'Chat' (standard) models across parameters.

Main Takeaways

Proprietary models (GPT-5, o3) currently dominate factual code QA, but open weights models like DeepSeek-V3 are competitive.
Reinforcement Learning (RL/GRPO) significantly improves factual accuracy over simple Supervised Fine-Tuning (SFT), proving the value of factuality-aware alignment.
Thinking models (with reasoning traces) scale logarithmically better with size than standard chat models, suggesting reasoning helps factuality.
RAG is superior for rapidly changing knowledge (libraries, APIs), while SFT/RL is better for stable core concepts; a hybrid approach is likely optimal.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and instruction tuning
Familiarity with Reinforcement Learning from Human Feedback (RLHF)
Basic knowledge of RAG (Retrieval-Augmented Generation)

Key Terms

GRPO: Group Relative Policy Optimization—a policy optimization algorithm that normalizes advantages within a group of sampled outputs for the same prompt, removing the need for a separate value function critic

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs

LLM-as-a-Judge: Using a strong LLM (like GPT-4) to evaluate the quality or correctness of another model's output

Pass@k: An evaluation metric measuring the probability that at least one of the top-k generated solutions is correct

fastText: A library for efficient text classification and representation learning, used here to identify code-related documents

RAG: Retrieval-Augmented Generation—enhancing model responses by retrieving relevant documents from an external knowledge base

FSDP: Fully Sharded Data Parallel—a memory optimization technique for training large models by sharding model parameters across GPUs

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs