← Back to Paper List

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A. R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang
Zhejiang University, Westlake University, Alibaba
arXiv (2026)
Agent Benchmark RL

📝 Paper Summary

Code Generation Mobile AI / On-device Inference
MobileKernelBench and the MoKA multi-agent system enable LLMs to overcome data scarcity and engineering complexity in generating efficient, compilable kernels for mobile inference engines.
Core Problem
Generating kernels for mobile devices is hindered by ecosystem fragmentation, engineering complexity (heterogeneous backends), and data scarcity, causing standard LLMs to hallucinate APIs and fail compilation.
Why it matters:
  • Mobile inference requires broad operator support for compatibility, which is labor-intensive to implement manually
  • Existing benchmarks focus on server-grade GPUs (CUDA), ignoring the unique constraints and lack of reference implementations in the mobile domain
  • Deploying deep learning on edge devices is critical for data safety and low latency, but the kernel development barrier prevents rapid model migration
Concrete Example: When asking an LLM to write a MatMul kernel for the MNN framework, it may hallucinate non-existent APIs or fail to handle broadcasting semantics, resulting in a compilation failure rate of over 54%.
Key Novelty
MobileKernelAgent (MoKA)
  • A multi-agent system (Coder, Debugger, Accelerator) that follows a plan-and-execute paradigm to iteratively refine kernels
  • Equipped with domain-specific tools (Repository Tree Builder, Error Extractor) to ground reasoning in the actual codebase structure, fixing the lack of framework knowledge in LLMs
Evaluation Highlights
  • MoKA achieves 93.7% compilation success on MobileKernelBench, drastically reducing the >54% failure rate of standard LLMs
  • 27.4% of MoKA-generated kernels deliver measurable speedups over native MNN library implementations
  • Standard LLMs achieve performance parity with native implementations in at most 16.3% of cases, highlighting the difficulty of the mobile domain
Breakthrough Assessment
8/10
First systematic study and benchmark for mobile kernel generation. The shift from <50% to >90% compilation success via agentic tooling is a significant practical leap.
×