← Back to Paper List

HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

Maoliang Li, Jiayu Chen, Zihao Zheng, Ziqian Li, Xinhao Sun, Guojie Luo, Chenchen Liu, Xiang Chen
School of Computer Science, Peking University, School of Computer Science, Northwestern Polytechnical University
arXiv, 3/2026 (2026)
RAG Agent Memory

📝 Paper Summary

Agentic RAG pipeline On-device inference optimization
HeRo reduces mobile agentic RAG latency by orchestrating heterogeneous accelerators (CPU/GPU/NPU) through profiling-based performance modeling, critical-path prioritized mapping, and bandwidth-aware concurrency control.
Core Problem
Deploying complex agentic RAG workflows on mobile SoCs is inefficient due to unmanaged contention for shared memory bandwidth and mismatch between dynamic task shapes and heterogeneous accelerator capabilities.
Why it matters:
  • Mobile devices store sensitive data requiring local processing, but naive scheduling fails to handle the multi-stage complexity of agentic RAG.
  • Existing mobile optimizations focus on single-model inference, ignoring the inter-stage dependencies and dynamic execution flows of multi-agent systems.
  • Uncoordinated concurrent execution on shared-memory SoCs (CPU, GPU, NPU) causes bandwidth saturation, negating the benefits of parallelism.
Concrete Example: A query rewriter might issue multiple search requests; naive scheduling might run the rewriter on a slow CPU or saturate DRAM by running all retrievals parallel to generation, stalling the critical path.
Key Novelty
Heterogeneity-Aware RAG Orchestration (HeRo)
  • Models the performance of RAG sub-stages on different accelerators (CPU/GPU/NPU) considering input shape and memory bandwidth contention.
  • Partitions stages into optimal batch sizes or token groups based on hardware affinity to balance latency and efficiency.
  • Dynamically schedules tasks by prioritizing the 'critical path' (longest estimated time to completion) and throttling concurrency when memory bandwidth saturation would degrade overall speed.
Evaluation Highlights
  • Reduces end-to-end latency by up to 10.94x compared to existing deployment strategies on commercial mobile devices.
  • Effective orchestration across heterogeneous units (CPU, GPU, NPU) on Snapdragon 8 Gen 3 and 8 Elite platforms.
  • Achieves practical on-device agentic RAG performance where naive baselines fail to meet latency requirements.
Breakthrough Assessment
7/10
Significant system-level improvement for a specific, high-value problem (on-device agentic RAG). The 10x speedup claim is impressive, though the scope is limited to inference orchestration rather than new model architectures.
×