CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

📝 Paper Summary

FPGA/ACAP Acceleration Hardware-Software Co-design Deep Learning Inference

CHARM improves deep learning inference on Versal ACAP platforms by partitioning hardware resources into multiple diverse accelerators that concurrently handle large and small matrix multiplications, avoiding the inefficiency of monolithic designs.

Core Problem

Deep learning models contain both large and small matrix multiplication layers; running small layers on a monolithic accelerator designed for large layers results in massive efficiency loss due to padding and underutilization.

Why it matters:

Off-chip bandwidth scales slower than computation resources, requiring high on-chip data reuse which favors large tiles, but large tiles kill performance for small layers
Real-world models like BERT have highly variable layer shapes; a single static accelerator configuration cannot optimally execute all of them
Current approaches either waste resources via padding (monolithic) or fail to maximize data reuse for large layers (many small duplicates)

Concrete Example: In BERT, small 'batch dot' layers constitute only 8% of operations but consume 88% of execution time on a monolithic accelerator because the 512x512x64 shapes must be padded to the accelerator's native 1536x128x1024 tile size, achieving <5% peak performance.

Key Novelty

Composing Heterogeneous Accelerators (CHARM)

Instead of one monolithic accelerator, the framework generates a system with 'diverse' accelerators (e.g., one huge, one small) co-existing on the same chip
Analytical modeling (CDAC) automatically partitions AIE cores, PL logic, and bandwidth between these accelerators based on the specific distribution of layer shapes in the target model
A runtime scheduler dynamically dispatches layers to the accelerator best suited for their shape (large layers to large acc, small layers to small acc) to maximize global throughput

Architecture

System architecture of the CHARM diverse MM accelerators design on Versal ACAP

Evaluation Highlights

5.29x throughput gain (1.46 TFLOPs) for BERT inference on VCK190 compared to a highly optimized monolithic accelerator baseline
32.51x throughput gain (1.61 TFLOPs) for Vision Transformer (ViT) inference, which is dominated by irregular small matrix shapes
94.7% computational efficiency achieved on single AI Engine cores for 32x32x32 matrix multiplication blocks

Breakthrough Assessment

8/10

Significant practical breakthrough for deploying Transformers on ACAP. The gain for irregular workloads (ViT) is massive. White-box open-source release adds high value for the hardware community.

⚙️ Technical Details

Problem Definition

Setting: Maximize end-to-end inference throughput for a neural network with $N$ layers of diverse Matrix Multiply shapes $M \times K \times N$ under fixed hardware resource constraints (AIE cores, PLIO, Bandwidth).

Inputs: Deep Learning Model (workload shapes), Platform Constraints (Versal VCK190 spec), Bandwidth Profile

Outputs: Bitstream containing partitioned heterogeneous accelerators and Host Runtime Binary for scheduling

Pipeline Flow

Workload Partitioning (Group layers by size)
Resource Partitioning (Allocating AIEs/PLIOs to groups)
Accelerator Generation (AIE/PL Code Gen)
Runtime Scheduling (Dispatch layers to specific accs)

System Modules

MM Accelerator 0 (Large) (Compute)

Execute large, compute-bound matrix multiplication layers (e.g., projection layers)

Model or implementation: Custom AIE Array (e.g., 8x4x8 configuration)

MM Accelerator 1 (Small) (Compute)

Execute small, bandwidth/latency-sensitive layers (e.g., attention heads/batch dots)

Model or implementation: Custom AIE Array (e.g., smaller configuration)

Runtime Scheduler (CRTS)

Dynamically dispatch kernel tasks to the appropriate accelerator based on dependency graph

Model or implementation: Host CPU (ARM) Process

Novel Architectural Elements

Heterogeneous Tiled Architecture: Co-location of accelerators with different native tile sizes ($TI \times TK \times TJ$) on the same AIE array
Partitioned PLIO interconnect: Dedicating specific PL-AIE bandwidth channels to specific accelerators to prevent contention

Modeling

Base Model: Custom Hardware Accelerators generated for BERT-base, ViT, NCF, MLP

Comparison to Prior Work

vs. AutoSA: CHARM leverages Versal AIEs (1GHz) + PL, whereas AutoSA targets PL only (lower freq/compute)
vs. Monolithic Designs (Eyeriss, etc.): CHARM uses diverse heterogeneous accelerators rather than one size fits all
vs. DPU/DNNExplorer: CHARM performs rigorous DSE to specialize *each* diverse accelerator for a subset of model layers, rather than duplicating generic ones

Limitations

Assumes off-chip bandwidth is evenly partitioned among accelerators (could be optimized further)
Currently specialized for dense Matrix Multiplication; non-MM layers handled by separate PL modules
Complexity of finding optimal partitions grows with the number of diverse accelerators (though reduced by sort-based heuristic)

Reproducibility

Code: https://github.com/arc-research-lab/CHARM

Fully open source white-box toolchain. Code available at github.com/arc-research-lab/CHARM. Detailed parameters for reproduction provided in paper tables. Uses Vitis 2021.1 and VCK190 board.

📊 Experiments & Results

Evaluation Setup

End-to-end inference on AMD/Xilinx Versal VCK190 evaluation board

Benchmarks:

BERT (Natural Language Processing)
ViT (Vision Transformer)
NCF (Neural Collaborative Filtering)
MLP (Multilayer Perceptron)

Metrics:

Throughput (GFLOPs and TFLOPs)
Power (Watts)
Energy Efficiency (GFLOPs/W)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
End-to-end throughput comparisons on VCK190 board showing CHARM's diverse accelerator approach (Two_diverse) outperforms Monolithic baselines significantly on irregular workloads (BERT, ViT).
BERT	GFLOPs	276.8	1464.2	+1187.4
ViT	GFLOPs	49.5	1609.0	+1559.5
NCF	GFLOPs	1736.0	1736.0	0.0
BERT	GFLOPs/W	7.48	35.98	+28.50
Microbenchmark (32x32x32)	Efficiency (%)	45.50	94.70	+49.20

Experiment Figures

Execution timeline comparing One Monolithic Acc vs. Two Diverse Accs for BERT

Throughput (GFLOPs) vs. Square Matrix Size for Monolithic vs. 8 Duplicated Accelerators

Main Takeaways

Monolithic accelerators fail catastrophically on models with mixed layer sizes (like ViT/BERT) due to padding overhead on small layers
Diverse accelerators (CHARM) solve this by specializing hardware resources: a small accelerator handles small layers efficiently while a large one handles large layers
For uniform workloads (NCF/MLP), the CHARM framework correctly identifies that a single monolithic accelerator is optimal, demonstrating robustness
The approach yields massive energy efficiency gains (up to 26.6x for ViT) by maximizing utilization and minimizing wasted operations

📚 Prerequisite Knowledge

Prerequisites

Understanding of FPGA/ACAP architecture (Programmable Logic vs. AI Engines)
Matrix Multiplication Tiling and Dataflow
Roofline performance model
Basic neural network layer shapes (BERT, ViT)

Key Terms

ACAP: Adaptive Compute Acceleration Platform—a hybrid architecture combining CPU, FPGA (PL), and vector processors (AIE)

AIE: AI Engine—a VLIW vector processor core on Xilinx Versal chips optimized for compute-intensive math

PL: Programmable Logic—the traditional FPGA fabric used here for data orchestration and non-MM layers

PLIO: Programmable Logic Input/Output—interfaces connecting the AIE array to the PL fabric

Monolithic Accelerator: A single large accelerator design that processes all layers sequentially, often requiring padding for smaller layers

VLIW: Very Long Instruction Word—a CPU architecture that executes multiple instructions in parallel per cycle

Tiling: Splitting large matrices into smaller blocks to fit into on-chip cache/memory for data reuse

CDSE: CHARM Design Space Exploration—module for optimizing single accelerator parameters

CDAC: CHARM Diverse Accelerator Composer—module for partitioning resources among multiple accelerators