Quantum Verifiable Rewards for Post-Training Qiskit Code Assistant

📝 Paper Summary

AI for Quantum Computing LLM Post-training for Code

This paper trains a specialized Large Language Model (LLM) for quantum computing by using pass/fail execution feedback from real quantum hardware and simulators as a reinforcement learning reward signal.

Core Problem

General-purpose coding models often generate quantum code (Qiskit) that is syntactically correct but fails to execute on actual quantum hardware due to deprecated APIs or violations of physical constraints.

Why it matters:

Quantum SDKs evolve rapidly, causing models trained on stale data to produce deprecated, non-executable code
Correct quantum programming requires adhering to strict physical hardware constraints (e.g., qubit connectivity) that standard language modeling objectives do not enforce
Existing execution-based feedback methods focus on CPU execution, missing the specific nuances of Quantum Processing Unit (QPU) compilation and execution

Concrete Example: A user asks for a circuit to run on a specific backend. A standard model might import `qiskit.providers.aer` (deprecated) or fail to transpile the circuit for the device's coupling map. The proposed model, optimized with hardware feedback, correctly identifies the backend and uses the modern `SamplerV2` primitive.

Key Novelty

Quantum-Verifiable Reinforcement Learning

Integrates a 'Quantum Verification' loop into the training pipeline, where generated code is executed on quantum simulators or hardware to generate a binary pass/fail reward
Uses Group Relative Policy Optimization (GRPO) to optimize the model specifically for this quantum execution reward, ensuring code isn't just plausible but physically executable
Distills quantum physics reasoning from a larger teacher model (DeepSeek V3) into the code assistant to improve problem understanding before code generation

Architecture

The GRPO training loop where the LLM interacts with a 'Quantum Sandbox' environment.

Evaluation Highlights

Achieves 28.48% pass@1 on the Qiskit-HumanEval-hard benchmark, outperforming the massive Qwen3-Coder-480B-Instruct (highest score among all evaluated models)
Outperforms the base Qwen2.5-Coder-14B-Instruct model by a margin of 12-16% on Qiskit-HumanEval-hard
Matches the performance of the 30x larger Qwen3-Coder-480B on the standard Qiskit-HumanEval benchmark (50.33% vs 51.65%)

Breakthrough Assessment

8/10

Novel application of physical hardware verification (QPUs) as an RL reward signal. Demonstrates that domain-specific verification can allow small models to outperform massive generalist models in specialized fields.

⚙️ Technical Details

Problem Definition

Setting: Code generation for Quantum Computing (Qiskit framework)

Inputs: Natural language prompt describing a quantum problem

Outputs: Executable Python code using the Qiskit SDK

Pipeline Flow

Input Prompt Processing
LLM Inference (Qiskit Code Assistant)
Code Output

System Modules

Qiskit Code Assistant

Generate Qiskit code from natural language prompts

Model or implementation: Qwen2.5-Coder-14B (Fine-tuned + RL aligned)

Novel Architectural Elements

Training pipeline integrates a 'Quantum Sandbox' that executes generated code on simulators/QPUs to provide ground-truth rewards (not used at inference time, but structural to the training loop)

Modeling

Base Model: Qwen2.5-Coder-14B

Training Method: Hybrid approach: EPT -> Weight Merge -> SFT -> DPO -> GRPO

Objective Functions:

Purpose: Optimize policy to maximize quantum execution success.

Formally: GRPO objective maximizing advantage A_i based on quantum verifiable reward (pass/fail) and KL divergence penalty.
Purpose: Align model to prefer executable code over non-executable code.

Formally: DPO loss maximizing log-likelihood difference between accepted and rejected code samples.

Adaptation: LoRA (rank=64, alpha=128) used for EPT and DPO stages; Full weight training for GRPO

Training Data:

EPT: 82 Million tokens of Qiskit code/notebooks
SFT: 10,076 synthetic samples (problem, quantum explanation, code) distilled from DeepSeek V3
DPO: 4.9k accepted/rejected pairs (rejected chosen via cosine similarity)
GRPO: 4.5k prompt/unit-test pairs

Key Hyperparameters:

learning_rate_grpo: 1e-6
learning_rate_dpo: 3e-5
learning_rate_ept: 2e-4
+ 5 more
batch_size_grpo: 512
batch_size_dpo: 64
group_size_G: 32
kl_beta_grpo: 0.01
dpo_beta: 0.2

Compute: GRPO: 32 NVIDIA A100 80GB GPUs; SFT/EPT: 16 NVIDIA A100 80GB GPUs

Comparison to Prior Work

vs. DeepSeek-R1: Applies GRPO specifically to the domain of quantum hardware execution rather than general reasoning
vs. Ether0: Focuses on quantum circuit generation and compilation correctness rather than chemical property optimization
vs. Standard Instruct Models: Incorporates domain-specific execution feedback (transpilation success, primitive execution) rather than just text-based instruction following

Limitations

Relies on synthetic data generation which may not capture all real-world edge cases
Quantum hardware execution is slow and costly compared to CPU execution, potentially creating bottlenecks during RL training
The approach is currently specific to Qiskit and may need adaptation for other quantum SDKs

Reproducibility

Training data is synthetic or crawled (licenses listed). Base models (Qwen, Mixtral) are open weights. Qiskit-HumanEval benchmark is available. No specific code repository for the *training pipeline* of this paper is provided.

📊 Experiments & Results

Evaluation Setup

Code generation assessed by unit test execution

Benchmarks:

Qiskit-HumanEval (QHE) (Function completion with signature provided)
Qiskit-HumanEval-hard (QHE-hard) (Full function generation from prompt (no imports/signature)) [New]

Metrics:

pass@1
pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Qiskit-HumanEval (QHE)	pass@1	51.65	50.33	-1.32
Qiskit-HumanEval-hard (QHE-hard)	pass@1	12.00	28.48	+16.48
Qiskit-HumanEval (QHE)	pass@1	42.00	50.33	+8.33

Experiment Figures

Pass@k curves comparing base Instruct models against various Qiskit-finetuned variants on QHE and QHE-hard.

Main Takeaways

Reinforcement Learning with quantum-verifiable rewards (GRPO) significantly improves performance on hard tasks where correct imports and API usage are critical
Small, domain-optimized models (14B) can match or outperform massive generalist models (480B) on specialized quantum coding tasks
Combining DPO (offline preference learning) and GRPO (online reinforcement learning) yields the best overall performance
SFT with 'reasoning traces' (distilled from teacher) helps understanding but sometimes reduces coding accuracy compared to pure RL optimization

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Model (LLM) post-training (SFT, RLHF)
Familiarity with Quantum Computing concepts (circuits, transpilation, primitives)
Knowledge of Reinforcement Learning algorithms (PPO, GRPO)

Key Terms

Qiskit: An open-source Software Development Kit (SDK) for working with quantum computers

QPU: Quantum Processing Unit—the physical hardware chip that performs quantum computations

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages within a group of sampled outputs to stabilize training without a separate value network

DPO: Direct Preference Optimization—a method to align models to preferences (e.g., correct vs incorrect code) without explicit reward modeling

SLERP: Spherical Linear Interpolation—a technique for merging model weights that preserves the geometric properties of the parameter space better than simple averaging

Transpilation: The process of rewriting a quantum circuit to match the specific constraints (connectivity, basis gates) of a target quantum device

Estimator/Sampler: Qiskit Runtime primitives; 'Estimator' calculates expectation values of operators, while 'Sampler' returns measured bitstrings from the quantum circuit

SFT: Supervised Fine-Tuning—training the model on high-quality input-output pairs

EPT: Extended Pretraining—continuing the pretraining phase on domain-specific data