HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

📝 Paper Summary

Agentic RAG pipeline On-device inference optimization

HeRo reduces mobile agentic RAG latency by orchestrating heterogeneous accelerators (CPU/GPU/NPU) through profiling-based performance modeling, critical-path prioritized mapping, and bandwidth-aware concurrency control.

Core Problem

Deploying complex agentic RAG workflows on mobile SoCs is inefficient due to unmanaged contention for shared memory bandwidth and mismatch between dynamic task shapes and heterogeneous accelerator capabilities.

Why it matters:

Mobile devices store sensitive data requiring local processing, but naive scheduling fails to handle the multi-stage complexity of agentic RAG.
Existing mobile optimizations focus on single-model inference, ignoring the inter-stage dependencies and dynamic execution flows of multi-agent systems.
Uncoordinated concurrent execution on shared-memory SoCs (CPU, GPU, NPU) causes bandwidth saturation, negating the benefits of parallelism.

Concrete Example: A query rewriter might issue multiple search requests; naive scheduling might run the rewriter on a slow CPU or saturate DRAM by running all retrievals parallel to generation, stalling the critical path.

Key Novelty

Heterogeneity-Aware RAG Orchestration (HeRo)

Models the performance of RAG sub-stages on different accelerators (CPU/GPU/NPU) considering input shape and memory bandwidth contention.
Partitions stages into optimal batch sizes or token groups based on hardware affinity to balance latency and efficiency.
Dynamically schedules tasks by prioritizing the 'critical path' (longest estimated time to completion) and throttling concurrency when memory bandwidth saturation would degrade overall speed.

Architecture

Conceptual comparison between naive RAG and Agentic RAG workflows, and the proposed HeRo framework stack.

Evaluation Highlights

Reduces end-to-end latency by up to 10.94x compared to existing deployment strategies on commercial mobile devices.
Effective orchestration across heterogeneous units (CPU, GPU, NPU) on Snapdragon 8 Gen 3 and 8 Elite platforms.
Achieves practical on-device agentic RAG performance where naive baselines fail to meet latency requirements.

Breakthrough Assessment

7/10

Significant system-level improvement for a specific, high-value problem (on-device agentic RAG). The 10x speedup claim is impressive, though the scope is limited to inference orchestration rather than new model architectures.

⚙️ Technical Details

Problem Definition

Setting: Minimizing end-to-end latency of a dynamic RAG task graph G(V, E) on a set of heterogeneous processing units K sharing unified DRAM.

Inputs: User query processed by an agentic workflow (potentially spawning dynamic sub-tasks like rewrite, retrieve, rerank).

Outputs: Final generated response.

Pipeline Flow

Input Query → [Shape-Aware Partitioner] → Sub-stages
Sub-stages → [Priority Estimator] → Criticality Score
Ready Tasks → [Concurrency Controller] → Accelerator Mapping (CPU/GPU/NPU)

System Modules

Performance Modeler (Orchestration)

Profiles and predicts latency and bandwidth usage for each model-PU configuration.

Sub-Stage Partitioner (Orchestration)

Splits logical stages into sub-stages (e.g., token groups or document batches) to optimize for hardware affinity.

Priority Estimator (Orchestration)

Calculates criticality scores based on the observed DAG and probabilistic future expansions.

Concurrency Controller (Orchestration)

Selects the best PU for critical tasks and limits non-critical parallelism to prevent memory bandwidth saturation.

Novel Architectural Elements

Profiling-based interference model that quantifies slowdown as a function of aggregate memory bandwidth usage.
Two-part criticality score combining deterministic observed path length with probabilistic future path estimation.
Bandwidth-aware admission control that explicitly throttles low-priority tasks to protect the critical path on shared-memory SoCs.

Modeling

Base Model: Various LLMs and embedding models (specifics implied by workload, e.g., Llama, embedding models for retrieval)

Comparison to Prior Work

vs. llama.cpp / mllm.npu: Optimizes multi-model agentic workflows rather than single-model inference.
vs. HeteroInfer: Adds bandwidth contention modeling and critical-path awareness specifically for the dynamic DAGs of agentic RAG.
vs. Ayo: HeRo handles dynamic inter-stage dependencies and heterogeneous hardware affinity, whereas Ayo focuses on static task decomposition.
+ 1 more
vs. HedraRAG: HeRo supports a broader set of heterogeneous PUs (including NPU) and complex agentic workflows beyond simple retrieval-generation.

Limitations

Relies on offline profiling which may not capture all runtime variances or new model types without re-profiling.
Heuristic-based scheduling (greedy) does not guarantee global optimality for the NP-hard scheduling problem.
Evaluation limited to Qualcomm Snapdragon SoCs; generalizability to other mobile chipsets (e.g., MediaTek, Apple Silicon) is not empirically verified.
Assumes statistical priors for future agent behavior which might be inaccurate for highly unpredictable user queries.

Reproducibility

Code availability is not provided. The paper uses commercial devices (Snapdragon 8 Gen 3 / 8 Elite) which are available but the specific profiling tools and scheduler implementation are not linked. Artifacts like profiling data or specific RAG workflow definitions are not mentioned as released.

📊 Experiments & Results

Evaluation Setup

On-device inference of agentic RAG workflows on commercial mobile phones.

Benchmarks:

Custom Agentic RAG Workflows (Retrieval-Augmented Generation with multiple agents) [New]

Metrics:

End-to-end Latency (ms)
Speedup (x)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Agentic RAG Workflows	Latency Reduction (Speedup)	1.0	10.94	9.94

Experiment Figures

Latency variation of different RAG stages (e.g., LLM generation vs. Indexing) across different accelerators (CPU, GPU, NPU) and batch sizes.

Impact of memory bandwidth contention on execution latency.

Three sub-figures illustrating: (a) Partitioning trade-offs, (b) Dynamic graph evolution, (c) Concurrency pitfalls.

Main Takeaways

Heterogeneous execution (CPU+GPU+NPU) significantly outperforms single-accelerator baselines for complex RAG workflows.
Bandwidth contention is a major bottleneck; naive parallelism can degrade performance without contention-aware scheduling.
Shape-aware partitioning allows NPUs to be utilized efficiently for batchable tasks (like indexing/reranking), leaving GPUs free for generation.
Critical-path awareness is essential for agentic workflows where dynamic dependencies create shifting bottlenecks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of System-on-Chip (SoC) architectures (shared memory, heterogeneous accelerators)
Basics of RAG (Retrieval-Augmented Generation) workflows
Task graph scheduling concepts (critical path, DAG)

Key Terms

SoC: System-on-Chip—an integrated circuit that integrates all components of a computer or other electronic system (CPU, GPU, NPU, memory) on a single chip.

NPU: Neural Processing Unit—a specialized circuit designed to accelerate machine learning algorithms.

Agentic RAG: A RAG system where autonomous agents dynamically decide steps like rewriting queries or multiple retrieval rounds, creating a complex, evolving execution graph.

DAG: Directed Acyclic Graph—a representation of tasks and dependencies where edges point from earlier tasks to later ones without loops.

Critical Path: The sequence of dependent tasks that determines the minimum possible duration of the entire process.

SJF: Shortest Job First—a scheduling policy that selects the waiting process with the smallest execution time to execute next.

PU: Processing Unit—generic term for a computation core like a CPU, GPU, or NPU.

Sub-stage: A fine-grained partition of a logical RAG stage (e.g., processing a batch of documents or generating a group of tokens) to expose parallelism.