MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

📝 Paper Summary

Code Generation Mobile AI / On-device Inference

MobileKernelBench and the MoKA multi-agent system enable LLMs to overcome data scarcity and engineering complexity in generating efficient, compilable kernels for mobile inference engines.

Core Problem

Generating kernels for mobile devices is hindered by ecosystem fragmentation, engineering complexity (heterogeneous backends), and data scarcity, causing standard LLMs to hallucinate APIs and fail compilation.

Why it matters:

Mobile inference requires broad operator support for compatibility, which is labor-intensive to implement manually
Existing benchmarks focus on server-grade GPUs (CUDA), ignoring the unique constraints and lack of reference implementations in the mobile domain
Deploying deep learning on edge devices is critical for data safety and low latency, but the kernel development barrier prevents rapid model migration

Concrete Example: When asking an LLM to write a MatMul kernel for the MNN framework, it may hallucinate non-existent APIs or fail to handle broadcasting semantics, resulting in a compilation failure rate of over 54%.

Key Novelty

MobileKernelAgent (MoKA)

A multi-agent system (Coder, Debugger, Accelerator) that follows a plan-and-execute paradigm to iteratively refine kernels
Equipped with domain-specific tools (Repository Tree Builder, Error Extractor) to ground reasoning in the actual codebase structure, fixing the lack of framework knowledge in LLMs

Architecture

The MoKA framework workflow and the MobileKernelBench evaluation pipeline.

Evaluation Highlights

MoKA achieves 93.7% compilation success on MobileKernelBench, drastically reducing the >54% failure rate of standard LLMs
27.4% of MoKA-generated kernels deliver measurable speedups over native MNN library implementations
Standard LLMs achieve performance parity with native implementations in at most 16.3% of cases, highlighting the difficulty of the mobile domain

Breakthrough Assessment

8/10

First systematic study and benchmark for mobile kernel generation. The shift from <50% to >90% compilation success via agentic tooling is a significant practical leap.

⚙️ Technical Details

Problem Definition

Setting: Generating C++ operator kernels for mobile inference frameworks (specifically MNN) given high-level operator specifications

Inputs: Task description derived from ONNX operator specifications (e.g., PyTorch module logic)

Outputs: Optimized C++ source code compatible with the target mobile framework's CPU backend

Pipeline Flow

Task Description -> Coder -> Initial Code -> Evaluation Pipeline
If Failure -> Debugger -> Repair Plan -> Coder -> Refined Code
If Success -> Accelerator -> Acceleration Plan -> Coder -> Optimized Code

System Modules

Coder

Synthesize C++ source code and execute modifications based on plans

Model or implementation: LLM (specific backbone not explicitly named in snippet)

Debugger (Planning & Reasoning)

Diagnose compilation and functional errors

Model or implementation: LLM (specific backbone not explicitly named in snippet)

Accelerator (Planning & Reasoning)

Optimize kernel performance after functional verification

Model or implementation: LLM (specific backbone not explicitly named in snippet)

Novel Architectural Elements

Integration of repository-aware tools (Tree-sitter based Error Extractor, Repo Tree Builder) directly into agent feedback loops
Automated cross-compilation and on-device verification pipeline (Host-to-Device bridge) embedded in the optimization loop

Modeling

Base Model: Not explicitly named in text (referred to as 'SOTA LLMs' and 'MoKA')

Training Method: Agentic Framework (Plan-and-Execute) with Tool Use. Paper also mentions comparing against standard fine-tuning (LoRA, GRPO).

Adaptation: LoRA and GRPO mentioned as baselines strategies that yielded negligible improvements

Trainable Parameters: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. KernelBench: Focuses on mobile CPUs (ARM64) and MNN framework rather than CUDA/GPUs
vs. Standard LLMs (Direct Prompting): Uses multi-agent loop with repo-aware tools to solve data scarcity and compilation issues
vs. Fine-tuning (LoRA): MoKA uses tool-augmented reasoning rather than just weight adaptation, which the paper claims failed to improve results significantly

Limitations

Evaluation is limited to the CPU backend of the MNN framework
Relies on the availability of a reference ONNX model for functional verification
Requires a complex host-device bridge setup for real-world performance measurement

Reproducibility

Benchmark dataset 'MobileKernelBench' is stated to be available. Evaluation environment requires MNN framework, Android NDK, and connected Android devices for verification. Specific backbone model for MoKA not named in snippet.

📊 Experiments & Results

Evaluation Setup

Cross-compilation on host, verification and benchmarking on Android mobile devices

Benchmarks:

MobileKernelBench (C++ Kernel Implementation) [New]

Metrics:

Compilation Success Rate (CSR)
Pass Rate (functional correctness)
Performance Speedup (vs native MNN)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MobileKernelBench	Compilation Success Rate	45.3	93.7	+48.4
MobileKernelBench	Kernels with Speedup > Native	0.0	27.4	+27.4
MobileKernelBench	Performance Parity Rate	16.3	Not reported in the paper	Not reported in the paper

Main Takeaways

Standard LLMs and even fine-tuned variants fail to generate compilable mobile kernels >54% of the time due to API hallucinations.
MoKA's agentic approach with repository-aware tools effectively bridges the data scarcity gap, achieving SOTA compilation rates (93.7%).
While significant speedups are achieved (27.4%), the majority of generated kernels still do not outperform highly optimized native libraries, indicating room for improvement.

📚 Prerequisite Knowledge

Prerequisites

Understanding of deep learning operators (Convolution, MatMul)
Familiarity with C++ and compilation pipelines (CMake, NDK)
Knowledge of mobile inference engines (MNN, NCNN)

Key Terms

Kernel: A low-level routine (usually C++ or assembly) that performs a specific mathematical operation (like matrix multiplication) optimized for specific hardware

ONNX: Open Neural Network Exchange—a standard format for representing machine learning models to allow interoperability between frameworks

MNN: Mobile Neural Network—a lightweight deep learning inference engine designed for mobile devices, developed by Alibaba

NDK: Native Development Kit—a toolset that allows developers to implement parts of Android apps in native code like C and C++

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique for LLMs

Opset: Operator Set—the versioned collection of operator definitions supported by ONNX

SOTA: State-of-the-Art—the current best performance achievable by existing methods