Measuring What Actually Matters: Per-Layer DPU Profiling on Kria KV260 with MobileNet and ResNet-50

PostedMarch 9, 2026

ByIuliia Gorshkova

Ivan Gubochkin, Iuliia Gorshkova, Pavel Salovskii

Simfero — dAIEDGE Project #101120726 — March 2026

This work was supported by the dAIEDGE Open Call Programme, funded by the European Union’s Horizon Europe research and innovation programme.

Aggregate FPS figures tell you whether a neural network meets a real-time threshold, but they reveal nothing about where the time is actually spent. When optimising an edge AI deployment, the question that matters is not “how fast is the whole model?” but “which layers dominate latency, and are they running at the hardware’s theoretical throughput?”

This article presents a systematic per-layer profiling methodology applied to three ImageNet classifiers — MobileNet V1, MobileNet V2, and ResNet-50 — deployed on the DPUCZDX8G deep learning processor unit of the AMD Xilinx Kria KV260 embedded platform. Using the Vitis AI vaitrace fine-grained profiling tool, we decompose end-to-end inference into 15, 36, and 55 individual DPU subgraphs respectively, reporting per-subgraph mean latency, standard deviation, coefficient of variation (CoV), time share, and effective throughput in GOP/s. We quantify the profiler’s own overhead at approximately 0.17 ms per subgraph boundary and show how to correct for it when interpreting profiling data. The analysis exposes hardware utilisation patterns — from sub-60 GOP/s at the first convolution layer to above 500 GOP/s at mid-stage 3×3 convolutions — that are invisible in aggregate benchmarks.

All measurements were collected on a production Kria KV260 board running Ubuntu 22.04 and the PYNQ DPU framework, providing a reproducible reference for engineers working with FPGA-accelerated CNN inference.

1. Introduction: Why Aggregate FPS Is Not Enough

Edge AI deployment typically begins with a simple question: does the model run fast enough? For a 30 fps camera pipeline, any model delivering inference in under 33 ms qualifies. On the Kria KV260 with the DPUCZDX8G DPU, MobileNet V1 achieves 187 FPS, MobileNet V2 reaches 169 FPS, and even ResNet-50 delivers 62 FPS — all comfortably above the real-time threshold.

But the aggregate number hides critical information. A model achieving 170 FPS might have one layer consuming 10% of total time at only 54 GOP/s while deeper layers run at 500 GOP/s. That bottleneck layer is where architectural changes, quantisation strategy adjustments, or DPU configuration tuning would yield the largest gains. Without per-layer visibility, engineers optimise blind.

This article provides that visibility. We use the AMD Vitis AI vaitrace tool in fine-grained mode to profile every DPU subgraph in three widely-used classifiers, then cross-reference the profiled latencies against production-mode (bypass) benchmarks to quantify the profiler’s own overhead. The result is a practical, reproducible methodology for identifying computational bottlenecks in FPGA-accelerated CNN inference.

2. Hardware and Software Platform

All experiments were conducted on a Kria KV260 Vision AI Starter Kit, which integrates a Zynq UltraScale+ MPSoC with a quad-core Arm Cortex-A53 CPU at 1.5 GHz, 4 GB DDR4 memory, and programmable logic with 256K logic cells and 1.2K DSP slices. The DPU IP core — specifically the DPUCZDX8G in the B4096 configuration — is loaded onto the FPGA fabric via the PYNQ DPU overlay. The B4096 designation indicates a peak throughput of 4,096 INT8 multiply-accumulate operations per clock cycle.

The software stack consists of Ubuntu 22.04, the PYNQ framework for hardware abstraction and DPU management, and the Vitis AI 3.0 toolchain for model compilation and profiling. Three pre-compiled xmodel files from the Vitis AI model zoo were used: MobileNet V1, MobileNet V2, and ResNet-50, all targeting the DPUCZDX8G_ISA1_B4096 DPU variant with INT8 quantisation and 224×224×3 input tensors.

3. Benchmarking and Profiling Methodology

3.1 Synthetic Benchmark Design

To isolate DPU inference performance from camera acquisition and image preprocessing overhead, we built a standalone benchmark script that feeds randomly generated 224×224×3 BGR images (fixed random seed 42) directly into the DPU input buffer. Each benchmark run consists of 10 warmup iterations to bring the DPU and memory subsystem to steady state, followed by 100 timed inference passes. Per-frame latency is recorded, and aggregate statistics — mean, standard deviation, minimum, maximum, P95, and P99 — are computed at the end of each run.

3.2 Two Measurement Modes

Each model is evaluated under two distinct modes. Bypass mode (production) executes inference exactly as a deployed application would, with no instrumentation overhead. The DPU’s internal scheduler processes all subgraphs in its native pipelined fashion. Latencies measured here represent actual deployment performance.

vaitrace fine-grained mode (profiling) intercepts execution at every subgraph boundary, recording per-subgraph timing. In this mode the DPU executes each subgraph independently, introducing scheduling and DMA overhead at every boundary. The per-layer data is invaluable for optimisation, but the measured end-to-end latency no longer matches production.

3.3 Quantifying Profiler Overhead

By running both modes on the same model with identical input, the difference in mean end-to-end latency isolates the profiler’s cumulative overhead. Dividing this overhead by the number of subgraph boundaries gives the per-boundary scheduling cost — a figure that engineers need when interpreting profiling data or when estimating whether merging subgraphs would yield measurable production gains.

4. Results

4.1 Production-Mode Performance

Table 1 presents the bypass-mode benchmark results. All three models deliver real-time inference with excellent determinism.

Table 1. DPU inference performance — bypass mode, no profiler overhead (DPUCZDX8G B4096, 100-frame synthetic benchmark, 224×224 input).

Model	Mean (ms)	FPS	Std (ms)	Min (ms)	Max (ms)	P95 (ms)	P99 (ms)
_{MobileNet V1}	_5.343	187.2	0.064	5.213	5.545	5.456	5.536
MobileNet V2	5.935	168.5	0.054	5.810	6.111	6.022	6.084
ResNet-50	16.075	62.2	0.057	15.953	16.236	16.164	16.231

The standard deviations of 0.054–0.064 ms across all three models confirm highly deterministic hardware execution. MobileNet V1 achieves the lowest latency at 5.343 ms; MobileNet V2 adds only 11% latency despite its more complex inverted-residual architecture; ResNet-50 is approximately 3× slower but still delivers 62 FPS.

4.2 Profiling-Mode Performance and Overhead Analysis

Table 2 shows the same benchmark repeated under vaitrace fine-grained profiling. Table 3 isolates the overhead by comparing the two modes.

Table 2. DPU inference performance — vaitrace fine-grained profiling mode. Latencies include per-subgraph profiler overhead.

Model	Subgraphs	Mean (ms)	FPS	Std (ms)	Min (ms)	Max (ms)	P95 (ms)	P99 (ms)
MobileNet V1	15	7.894	126.7	0.122	7.736	8.871	8.021	8.142
MobileNet V2	36	11.289	88.6	0.106	11.168	11.673	11.492	11.634
ResNet-50	55	25.333	39.5	0.271	25.106	27.668	25.550	26.090

Table 3. Profiler overhead quantification — difference between vaitrace and bypass mode.

Model	Subgraphs	Overhead (ms)	≈ ms / subgraph
MobileNet V1	15	2.6	0.17
MobileNet V2	36	5.4	0.15
ResNet-50	55	9.3	0.17

The profiler inflates end-to-end latency by 48–58%, and standard deviations increase by 2–4×. The overhead scales linearly with the number of subgraph boundaries at approximately 0.15–0.17 ms per boundary. This is a scheduling and DMA cost absent in production, and it must be accounted for when using profiling data to estimate real-world performance of individual layers.

4.3 Per-Layer Analysis: MobileNet V1

MobileNet V1 decomposes into 15 DPU subgraphs with a total profiled time of 5.651 ms. Table 4 highlights representative layers from the full dataset (110 inference runs).

Table 4. Selected per-subgraph profiling data for MobileNet V1 (DPUCZDX8G B4096).

Layer	Mean (ms)	Std (ms)	CoV (%)	Time (%)	GOP/s	Notes
Conv2d_0	0.557	0.033	6.0	9.9	54.1	Most expensive; 224×224 input
Conv2d_3_pointwise	0.368	0.067	18.2	6.5	302.4	Highest CoV — sched. jitter
Conv2d_5_pointwise	0.327	0.004	1.2	5.8	306.1	Peak efficiency in V1
Conv2d_12_pointwise	0.310	0.007	2.3	5.5	161.3	Late-stage standard conv
Logits/Conv2d_1c_1x1	0.324	0.009	2.7	5.7	0.0	FC via 1×1; negligible ops

The initial standard 3×3 convolution (Conv2d_0) processes the full 224×224 input and is the single most expensive layer at 9.9% of total time, despite achieving only 54 GOP/s. This low efficiency is characteristic of input layers where the spatial resolution is high but the channel depth is shallow (3 channels), underutilising the DPU’s MAC array which is optimised for deeper channel counts.

Pointwise convolutions in deeper stages — where channel depth grows to 256–512 — achieve efficiencies above 300 GOP/s, a 5.7× improvement over the input layer. This demonstrates how the DPU’s parallelism is only fully exploited when there are sufficient channels to fill the processing element array.

Layer Conv2d_3_pointwise exhibits the highest coefficient of variation (18.2%), which we attribute to occasional scheduling jitter in the profiling debug mode rather than to a hardware bottleneck. The bypass-mode standard deviation of 0.064 ms for the full model confirms that this variability disappears in production.

4.4 Per-Layer Analysis: MobileNet V2

MobileNet V2’s inverted residual bottleneck architecture results in 36 subgraphs — 2.4× the count of MobileNet V1 — with a total profiled time of 8.884 ms. This higher subgraph count accounts for the larger profiler overhead (5.4 ms vs. 2.6 ms). The initial strided convolution (subgraph_263) and first expansion layer (subgraph_271) dominate at 6.2% and 6.1% of total time respectively. Execution variance is low across all subgraphs (CoV below 10%), indicating stable DPU scheduling even with the higher subgraph count.

The effective throughput profile follows a similar pattern to MobileNet V1: early layers operating at 35–55 GOP/s due to limited channel depth, with mid-to-late layers reaching 65–133 GOP/s. The overall lower GOP/s figures compared to MobileNet V1’s pointwise convolutions reflect MobileNet V2’s depthwise separable architecture, which produces more numerous subgraphs with smaller per-subgraph workloads.

4.5 Per-Layer Analysis: ResNet-50

ResNet-50 is the deepest model tested with 55 subgraphs and a total profiled time of 22.660 ms. Table 5 highlights key layers.

Table 5. Selected per-subgraph profiling data for ResNet-50 (DPUCZDX8G B4096).

Layer	Mean (ms)	Std (ms)	CoV (%)	Time (%)	GOP/s	Notes
conv1 (7×7)	0.927	0.034	3.6	4.1	259.2	Most expensive; full-res
res3a_branch2b	0.487	0.003	0.5	2.1	472.7	Peak mid-stage efficiency
res4a_branch2b	0.462	0.002	0.5	2.0	497.6	Near-peak utilization
res4c	0.531	0.077	14.5	2.3	189.9	High CoV — debug jitter
res5b_branch2b	0.572	0.100	17.6	2.5	406.8	Highest CoV in ResNet-50
pool5	0.156	0.003	2.1	0.7	0.0	Global avg pool; no MACs

The 7×7 convolution at the network input (conv1) is again the single most expensive layer at 4.1% of total time, though it achieves 259 GOP/s — better than MobileNet V1’s input layer because ResNet-50’s conv1 has 64 output channels, partially filling the PE array. The 3×3 convolutions in stages 3–4 achieve the highest efficiencies, consistently reaching 470–500 GOP/s. This represents substantial utilisation of the B4096 DPU’s theoretical peak.

Two layers exhibit elevated CoV: res4c at 14.5% and res5b_branch2b at 17.6%. Both occur in the middle of long residual chains where the profiler’s per-subgraph scheduling introduces occasional delays. As with MobileNet V1, these anomalies are artefacts of the profiling mode and do not affect production execution, where the full model runs with a standard deviation of only 0.057 ms.

5. Practical Implications for Edge AI Engineers

5.1 The Input Layer Is the Universal Bottleneck

Across all three architectures, the very first convolution is the least efficient layer in terms of MAC utilisation. This is a direct consequence of processing high-resolution spatial data (224×224) with shallow channel depth (3 for RGB input). For engineers seeking to improve throughput, strategies like reducing input resolution, increasing stride in the first layer, or using a stem block that quickly expands channel depth would specifically target this bottleneck.

5.2 Subgraph Count as a Proxy for Scheduling Overhead

The 0.17 ms per-boundary cost is a profiling artefact, but subgraph boundaries also exist in production as scheduling points in the DPU’s instruction pipeline. While bypass mode pipelines these boundaries efficiently, the data suggests that models with fewer, larger subgraphs (MobileNet V1 with 15) achieve slightly better per-operation efficiency than architecturally more complex models with many small subgraphs (MobileNet V2 with 36). When choosing between architectures for a latency-sensitive deployment, subgraph count is a useful proxy for scheduling overhead.

5.3 How to Read vaitrace Profiling Data Correctly

Engineers using vaitrace should apply three corrections when interpreting results:

Subtract the per-boundary overhead (~0.17 ms × number of subgraph boundaries) from the total profiled time to estimate production-mode latency.
Treat high CoV values (>10%) in individual layers as profiler scheduling jitter, not hardware variability — unless they persist in bypass mode.
Compare effective throughput (GOP/s) across layers relative to each other, not as absolute utilisation figures, since the profiler’s overhead distorts individual layer timings.

5.4 What We Would Do Differently

With hindsight, two methodological improvements would strengthen the analysis. First, running more than 100 timed iterations (ideally 500–1,000) would reduce confidence intervals on per-layer statistics and clarify whether high-CoV layers are genuinely variable or simply under-sampled. Second, capturing DPU hardware counters (if exposed by the overlay) alongside timing data would allow correlating throughput drops with specific resource constraints — such as BRAM bandwidth saturation or DDR access contention — rather than relying on inference from GOP/s figures alone.

6. Conclusion

Per-layer profiling transforms edge AI optimisation from guesswork into engineering. On the Kria KV260 with the DPUCZDX8G B4096 DPU, this study demonstrated that:

All three tested models (MobileNet V1, V2, ResNet-50) exceed real-time 30 fps requirements with highly deterministic latency (standard deviation < 0.065 ms in production mode).
The vaitrace fine-grained profiler adds approximately 0.17 ms per subgraph boundary, scaling linearly with model depth (2.6 ms for 15 subgraphs, 9.3 ms for 55 subgraphs).
Input convolution layers operate at 54–259 GOP/s while mid-stage 3×3 convolutions reach 470–500 GOP/s — a 5–9× efficiency gap that defines the optimisation frontier for these architectures.
High CoV values observed in profiling mode (up to 18%) are profiler artefacts that vanish in production, confirming that the DPU hardware delivers consistent performance.

The complete per-subgraph datasets for all three models — 15 layers for MobileNet V1, 36 for MobileNet V2, and 55 for ResNet-50 — are available in the full technical report (dAIEDGE Project #101120726, Technical Report #1). We encourage engineers working with FPGA-accelerated inference to adopt dual-mode benchmarking (bypass + vaitrace) as a standard practice: aggregate metrics validate deployment readiness, while per-layer profiles guide architectural optimisation.

References

[1] AMD. Kria SOMs. https://www.amd.com/en/products/system-on-modules/kria.html

[2] AMD. Vision AI DPU-PYNQ. https://www.amd.com/en/developer/resources/kria-apps/vision-ai-dpu-pynq.html

[3] AMD. Zynq UltraScale+ MPSoC Data Sheet: Overview (DS891).

[4] AMD Xilinx. Deep-Learning Processor Unit — Vitis AI 3.0 Documentation.

[5] AMD Xilinx. DPU for Convolutional Neural Network. https://www.xilinx.com/products/intellectual-property/dpu.html

[6] AMD Xilinx. Kria-PYNQ: PYNQ Support and Examples for Kria SOMs. https://github.com/Xilinx/Kria-PYNQ

[7] AMD Xilinx. PYNQ: Python Productivity for AMD Adaptive Computing Platforms. http://www.pynq.io/

[8] Intel Corporation. Intel RealSense SDK 2.0. https://github.com/IntelRealSense/librealsense

https://www.linkedin.com/posts/partenit_edgeai-fpga-kria-activity-7434916880232013824-Jo2Q

https://medium.com/p/576a2427b3b7

AI in Education: Fundamentals & Tools

Understanding AI Basics

Practical AI Tools for Educators

Integrating AI into Teaching

Case Studies & Success Stories

Ethical AI & Inclusive Practices

Ethical & Legal Frameworks for AI Systems

Equity & Inclusion

Transparency & Trust

Algorithmic Bias, Discrimination & Legal Risk

AI Errors, Accountability & Legal Responsibility

Institutional AI Governance & Policy Design

AI, Security & GDPR Compliance

Data Protection & Privacy

Cybersecurity in AI

EU Regulations & Policies

Engaging Parents & Guardians

Data Protection, Privacy & AI Governance

Additional Resources

Glossary of Terms

Templates & Guides

Webinars & Research

AI for Administrative & Pedagogical Support

AI for Time Management

AI in Student Performance Tracking

AI for Automated Communication

AI for Document Management

AI, Robotics & Biotech Regulation in Europe

EU AI Act Explained: Scope, Risk Levels, Obligations

AI-Enabled Products: Robots, Medical Software, Smart Devices

Machinery Regulation & Safety Standards for Intelligent Systems

AI in Healthcare & Biotech: MDR, IVDR, EMA

Liability & Responsibility for AI-Driven Systems

Compliance in Practice: From Risk Assessment to CE Marking

Country-Specific AI Regulation & Enforcement

How EU AI Law Is Applied at National Level

Different Compliance Models

National AI Sandboxes & Regulatory Experiments

Public Sector AI Rules Across Europe

Cross-Border AI Deployment Challenges

When National Law Overrides EU Guidance

Legal Cases, Enforcement & Real-World Precedents

AI Act, GDPR & Algorithmic Decision-Making Cases

Liability Disputes Involving Automated Systems

Robotics Accidents & Legal Responsibility

Medical & Biotech AI Failures: Lessons Learned

What Enforcement Trends Tell Us About Future Regulation