Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

📝 Paper Summary

Self-evolving Agentic reasoning Memory recall

EvoKernel bridges the data-scarce gap in NPU kernel programming by learning stage-specific Q-values to retrieve high-utility experiences for drafting and refining kernels without fine-tuning.

Core Problem

General-purpose LLMs fail catastrophically on data-scarce hardware backends (like NPUs) because there are no training examples to bridge the gap between CUDA and new DSLs.

Why it matters:

Emerging hardware (NPUs, TPUs) faces a 'Data Wall' with opaque compilers and scarce expert code, unlike the data-rich NVIDIA/CUDA ecosystem
Correctness is binary and machine-verifiable, meaning 'partially correct' solutions are useless, and standard RAG fails due to sparsity
Fine-tuning is prohibitively expensive and requires thousands of expert demonstrations which do not exist for new architectures

Concrete Example: GPT-5.2 achieves 92% success on CUDA kernels but drops to 14% on Ascend C (NPU) kernels. It fails to compile valid code because it relies on memorized CUDA patterns rather than learning the new NPU syntax and memory hierarchy constraints.

Key Novelty

Value-Driven Memory Retrieval (Q-Memory)

Instead of static semantic similarity, the agent learns Q-values for memory items, estimating their utility for specific stages (Drafting vs. Refining)
Treats retrieval as an action in a Memory-based MDP, updating retrieval priorities via Monte-Carlo returns from compiler/profiler feedback without updating LLM weights
Enables cross-task memory sharing, where insights from simpler kernels (L1) autonomously bootstrap solutions for complex ones (L2) via emergent curricula

Architecture

The EvoKernel framework lifecycle: Cold-Start Drafting -> Environment/Memory Interaction -> Continual Refining

Evaluation Highlights

Boosts GPT-5.2 functional correctness on Ascend C NPU kernels from 11.0% (Pass@k) to 83.0% (EvoKernel) on KernelBench
Achieves median speedup of 3.60x over initial correct drafts through iterative refinement
Transfers effectively to unseen architectures: solves 10/15 mHC kernels for DeepSeek architecture, with up to 41.96x speedup over PyTorch baselines

Breakthrough Assessment

9/10

Demonstrates a massive (70%+) improvement in a verified, hard-constraint coding domain without fine-tuning. Successfully applies memory-based RL to bridge the data scarcity gap for new hardware.

⚙️ Technical Details

Problem Definition

Setting: Memory-based Markov Decision Process (M-MDP) for code generation

Inputs: Kernel task specification x (PyTorch reference + metadata)

Outputs: Optimized kernel source code y minimizing latency ℓ_lat subject to correctness constraints

Pipeline Flow

Stage 1: Cold-Start Drafting (Bootstrapping feasibility)
Stage 2: Continual Refining (Optimizing latency)
Memory Update (Unified Value Iteration)

System Modules

Value-Driven Retriever

Selects context items (code, traces, docs) based on learned Q-values

Model or implementation: Non-parametric Q-table over memory items

Generator

Synthesizes kernel code based on retrieved context

Model or implementation: GPT-5.2 / DeepSeek-V3.2 / Qwen3-Coder-30B

Multi-Gate Verifier

Evaluates code for hacking, compilation, correctness, and latency

Model or implementation: Ascend C Toolchain / Profiler

Novel Architectural Elements

Dual-stage memory addressing: Q1 values for feasibility (Stage 1) and Q2 values for latency optimization (Stage 2)
Unified Monte-Carlo update rule for retrieval policy that adapts to verifier feedback without gradient updates to the LLM

Modeling

Base Model: GPT-5.2 (primary), DeepSeek-V3.2, Qwen3-Coder-30B

Training Method: In-context Learning with Value-Driven Retrieval (Inference-time RL)

Objective Functions:

Purpose: Update Q-values of retrieved memory items based on outcome.

Formally: Q(s,m) ← Q(s,m) + α(r - Q(s,m))

Key Hyperparameters:

budget_T: 30 iterations per operator
candidate_pool_multiplier_lambda: 15 (found to be optimal over 2)
verification_tolerances: atol=1e-2, rtol=1e-2

Compute: Ascend C NPU environment for verification/profiling. Inference-only for LLMs.

Comparison to Prior Work

vs. Pass@k: Stateful memory accumulates success/failure info
vs. Refinement: Shared memory across tasks enables transfer learning (L1 -> L2)
vs. Codex: Uses structured value-driven retrieval rather than unguided shell interaction; +37.0% correctness gain
+ 1 more
vs. Retrieval-based baselines: Uses learned Q-values rather than static semantic similarity, critical for code where surface similarity != utility

Limitations

Dependency on verifiable feedback (compilation/execution), making it less applicable to open-ended generation
Performance gains on weaker models (DeepSeek, Qwen) are smaller than on GPT-5.2
Requires NPU hardware environment for the specific feedback loop

Reproducibility

Code: https://evokernel.zhuo.li

Publicly available code at https://evokernel.zhuo.li. Requires Ascend C toolchain/hardware for NPU experiments. Baselines (Codex, Refinement) described in detail. Uses commercial models (GPT-5.2) and open models (DeepSeek, Qwen).

📊 Experiments & Results

Evaluation Setup

Kernel synthesis on Ascend C (NPU) and CUDA

Benchmarks:

KernelBench (NPU Kernel Synthesis (L1 & L2 operators))
Attention Set (Application-driven kernels (70 ops))
mHC Kernels (DeepSeek architecture operators (15 ops))

Metrics:

Compilation Rate (CR)
Correctness (Acc)
Speedup (vs. Reference & Initial Draft)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
KernelBench (Ascend C)	Correctness (Acc) - Overall	11.0	83.0	+72.0
KernelBench (Ascend C)	Correctness (Acc) - Overall	46.0	83.0	+37.0
KernelBench (Ascend C)	Correctness (Acc) - Level 2	22.0	76.0	+54.0
Ablation study demonstrating the value of cross-task memory transfer and value-based retrieval.
KernelBench L2	Correctness (Acc)	34.0	64.0	+30.0
KernelBench L2	Correctness (Acc)	67.0	77.0	+10.0

Experiment Figures

Optimization outcomes: Speedup distribution and specific operator optimization trajectories

Transfer learning efficiency: Cumulative correctness on L2 operators with different memory initializations

Main Takeaways

Frontier models (GPT-5.2) benefit disproportionately from memory compared to smaller models, suggesting strong in-context learning is a prerequisite
Emergent curriculum: The agent autonomously solves simpler operators first and uses them as references for harder ones
Memory is transferrable across backbones: Memories generated by GPT-5.2 significantly boost DeepSeek-V3.2 (6% -> 58% Acc on holdout set)
Value-driven retrieval outperforms heuristic baselines by learning to identify high-utility examples that semantic similarity misses

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Q-learning, Monte-Carlo updates)
Kernel programming concepts (CUDA, Ascend C, memory hierarchy)
Retrieval-Augmented Generation (RAG)

Key Terms

Cold-Start: A scenario where an agent must perform a task without prior training data or expert demonstrations

Ascend C: A domain-specific language (DSL) for programming Huawei Ascend NPUs, similar to CUDA but with different memory hierarchies and APIs

Pass@k: A metric measuring the probability that at least one out of k generated code samples is correct

Monte-Carlo update: An RL update method that uses the total accumulated reward from a complete episode to update value estimates

Q-value: An estimate of the expected future reward of taking a specific action (here, retrieving a specific memory item) in a given state

M-MDP: Memory-based Markov Decision Process—an extension of MDPs where the state includes a dynamic external memory bank

EvoKernel: The proposed framework: a self-evolving agent that drafts and refines kernels using value-driven memory

KernelBench: A benchmark suite for evaluating LLMs on GPU/NPU kernel generation tasks

mHC kernels: micro-Heterogeneous Computing kernels, specialized operators for DeepSeek architectures

epsilon-greedy: An exploration strategy where the agent chooses the best known action most of the time but selects a random action with probability epsilon to explore

PopArt: A normalization technique for rewards in RL (Preserving Outputs Precise Adaptive Robust Transformation) to handle varying reward scales