Skip to main content
< All Topics
Print

What One Inference Costs in Watts: A Practical Power Measurement Guide for FPGA Edge AI

Ivan Gubochkin, Iuliia Gorshkova, Pavel Salovskii

Simfero — dAIEDGE Project #101120726 — March 2026

This work was supported by the dAIEDGE Open Call Programme, , funded by the European Union’s Horizon Europe research and innovation programme.

Every edge AI datasheet quotes inference speed. Almost none quote the power cost of that inference. Yet for battery-powered drones, robots, and field-deployed sensors, the question is not “how many frames per second?” but “how many inferences per joule?”

This article presents a complete, reproducible power measurement methodology for FPGA-accelerated edge AI systems, demonstrated on the AMD Xilinx Kria KV260 platform running MobileNet V1, MobileNet V2, and ResNet-50 on the DPUCZDX8G deep learning processor unit. We describe two complementary measurement levels: SOM-internal monitoring via on-board voltage/current sensors (4.3–4.8 W during active DPU inference), and external hardware measurement at the 12 V DC supply rail using a shunt resistor, digital multimeter, and oscilloscope (9.15 W idle, 10.13 W under DPU load — a 0.98 W increment attributable to the DPU compute array). We provide the full bill of materials (under $100 in additional equipment), the circuit schematic, the oscilloscope capture methodology showing four distinct operating phases, and the derived energy-efficiency metric of inferences per joule. The methodology is directly transferable to any embedded board with a DC supply rail.

1. Introduction: The Missing Metric in Edge AI

When evaluating an edge AI platform, engineers typically focus on two numbers: inference latency and throughput (FPS). These are necessary but insufficient. A model that runs at 187 FPS is useless in a battery-powered application if it drains the battery in 20 minutes. Conversely, a “slow” 62 FPS model might be perfectly viable if its power draw fits within a 10 W thermal envelope.

The problem is that power data is rarely measured at the right level. SOM-internal sensors report the power consumed by the processing system and programmable logic, but they miss the voltage regulator losses, board peripherals, and cooling overhead that determine actual battery life. External supply-rail measurements capture everything but lack the granularity to attribute power to specific subsystems.

The answer is to measure at both levels and compare. This article describes how to do exactly that, using inexpensive off-the-shelf instruments, on the Kria KV260 platform. The methodology is general: any embedded board with a DC power input and accessible supply rails can be characterised the same way.

2. Equipment and Bill of Materials

The external measurement setup requires minimal additional hardware. Table 1 lists the complete bill of materials. The total cost of additional equipment beyond what ships with the Kria kit is under $100.

Table 1. Bill of materials for external power measurement.

ComponentModel / SpecRoleApprox. Cost
Shunt resistor0.1 Ω, 5 W, wirewoundCurrent→voltage conversion$1–2
Digital multimeterAICEVOOS AS-98D (or equiv.)Absolute DC current reading$15–40
OscilloscopeFNIRSI DPOX180H (or equiv.)Current transient capture$70–150
Breadboard + wiresStandard, 22 AWG clip leadsShunt mounting & connections$5–10
AC/DC power supply12 V, 3 A (Kria stock PSU)Board supply railIncluded

The key component is the shunt resistor: a 0.1 Ω, 5 W wirewound resistor inserted in series with the +12 V supply rail. At the measured current range of 0.75–0.83 A, the voltage drop across the shunt is 75–83 mV — well within the measurement range of both the multimeter and the oscilloscope, while introducing only ~70 mW of measurement overhead (less than 0.7% of total board power).

3. Measurement Architecture

3.1 Level 1: SOM-Internal Monitoring

The Kria K26 SOM includes on-board INA current/voltage sensors that are accessible via sysfs or programmatically through the PYNQ framework. These sensors report real-time power consumption of the SOM module itself, covering the processing system (PS), programmable logic (PL), and DDR memory interface. The readings are displayed in our camera demonstration application as a “System Info” overlay, showing power in milliwatts, current, voltage, temperatures across three thermal zones (LPD, FPD, PL), per-core CPU utilisation, and RAM usage.

This level of monitoring is valuable for understanding the power distribution within the SOM, but it systematically understates total system power because it excludes board-level voltage regulators, the Ethernet PHY, USB interfaces, the heatsink fan (if present), and any switching losses in the 12 V to SOM voltage conversion chain.

3.2 Level 2: External Supply-Rail Measurement

To capture total system power including all board-level losses, we insert a precision shunt resistor in series with the positive 12 V supply rail. Two instruments operate simultaneously:

  • The AICEVOOS AS-98D digital multimeter, configured in DC current measurement mode, provides absolute steady-state current readings with milliamp resolution.
  • The FNIRSI DPOX180H digital oscilloscope, connected differentially across the shunt resistor (CH1, 1:1 probe, 20 mV/div), captures current transients at an effective resolution of 200 mA/div. This reveals dynamic behaviour — such as initialisation surges, inference-loop ripple, and post-benchmark settling — that the multimeter’s averaging filter masks.

The circuit is straightforward: the shunt resistor sits between the AC/DC power supply’s +12 V output and the Kria board’s power input. The multimeter is in series (DC current mode). The oscilloscope probes connect across the shunt (not to ground). The ground reference for the oscilloscope is the power supply’s negative terminal.

3.3 Why Both Levels Matter

Neither measurement alone tells the full story. The SOM sensors report 4.3–4.8 W during DPU inference. The external shunt reads 10.13 W. The difference — roughly 5.3–5.8 W — represents the combined cost of board-level voltage regulators, the Ethernet PHY, DDR memory power not captured by SOM sensors, and other peripheral circuitry. An engineer designing a custom carrier board could potentially recover a significant fraction of this overhead by eliminating unused peripherals and optimising the power delivery network.

4. Measurement Procedure

4.1 Test Conditions

All measurements were taken with the following configuration: the Kria KV260 board powered from its stock 12 V AC/DC adapter, Ethernet connected (link up, minimal traffic), no camera attached during benchmark runs, and the DPU overlay loaded with the DPUCZDX8G B4096 configuration. The benchmark script executes 10 warmup iterations followed by 100 timed DPU inference passes using synthetic 224×224×3 input data. Room temperature was approximately 23°C.

4.2 Idle Baseline

Before launching any workload, we record the idle baseline: the board fully booted with Ubuntu 22.04 running, DPU overlay not yet loaded, no user applications executing. This gives the quiescent power draw of the complete system including the CPU at idle, DDR refresh, Ethernet PHY, and all voltage regulators. Our measured idle baseline is 0.75 A at 12.2 V, yielding 9.15 W.

4.3 Active Workload Measurement

With the oscilloscope running in continuous capture mode and the multimeter recording, we launch the DPU benchmark. The oscilloscope trace reveals four distinct operating phases, summarised in Table 2.

Table 2. Operating phases visible in the oscilloscope trace during a complete benchmark run.

PhaseDescriptionShunt VoltageCurrent (A)Power (W)
1. Idle CPUSystem at rest before benchmark~75 mV~0.75~9.15
2. DPU initBitstream load + weight DMATransient spikesVariableVariable
3. Benchmarking100 inference iterations~83 mV peak~0.83~10.13
4. Results savingvaitrace CSV serialisationDeclining to idle→ 0.75→ 9.15

Phase 1 (idle CPU) provides the baseline reference. Phase 2 (DPU initialisation) shows a transient current surge as the PYNQ framework loads the DPU bitstream onto the FPGA fabric and DMA-transfers quantised model weights into on-chip buffers; this phase exhibits elevated current with high-frequency noise bursts reflecting intensive DDR and AXI bus activity. Phase 3 (benchmarking) is the measurement target: a sustained, slightly elevated current plateau with a characteristic repetitive ripple pattern driven by the DPU MAC array cycling through subgraph execution. Phase 4 (results saving) shows a brief residual elevation as the CPU serialises profiling output before returning to idle.

The peak current reading from the multimeter (0.83 A) is cross-validated against the oscilloscope’s peak shunt voltage (~83 mV across 0.1 Ω = 0.83 A). Agreement between the two instruments confirms measurement consistency.

5. Results

5.1 SOM-Internal Power Profile

Table 3 presents the SOM-internal sensor readings captured during active camera streaming and DPU inference (MobileNet V2 running at 30 fps with live camera feed).

Table 3. SOM-internal sensor readings during active DPU inference with camera streaming.

ParameterValueUnit
SOM Total Power4,300 – 4,800mW
SOM Total Current~860mA
SOM Operating Voltage~5,056mV
LPD Temperature~30°C
FPD Temperature~31°C
PL Temperature~29°C
CPU Utilisation (4 cores)7 / 18 / 3 / 10%
RAM Usage956 / 3,911 (24.5%)MB

The SOM draws 4.3–4.8 W total, with thermal readings of 29–31°C across all three zones, confirming that the passive heatsink on the KV260 provides adequate cooling for this workload. CPU utilisation is asymmetric across cores because the Python application, OpenCV preprocessing, and PYNQ framework do not fully parallelise across all four Cortex-A53 cores.

5.2 External Supply-Rail Measurements

Table 4 summarises the system-level power at the 12 V DC supply rail.

Table 4. System-level power measurements at the 12 V DC supply rail.

ConditionV_supply (V)I_meas (A)P = V × I (W)ΔP (W)
Idle (CPU only)12.20.759.15
DPU benchmark (peak)12.20.8310.13+0.98

The idle board draws 9.15 W, encompassing the Kria board’s switching regulators, Arm CPU cores at idle, DDR memory refresh, Ethernet PHY, and all supporting circuitry. Under DPU load, the current rises by 0.08 A to a peak of 0.83 A (10.13 W). The 0.98 W increment is attributable to the DPU compute array, increased DMA activity, and higher DDR bandwidth demand during inference.

5.3 Power Budget Reconciliation

Table 5 reconciles the two measurement levels.

Table 5. Power budget reconciliation across measurement levels.

What is measuredMethodValue (W)Includes
SOM internalOn-board INA sensors4.3 – 4.8PS + PL + DDR (SOM only)
System at 12 V rail (idle)External shunt9.15SOM + regulators + peripherals
System at 12 V rail (DPU)External shunt10.13Everything above + DPU compute
DPU incrementRail_DPU − Rail_idle0.98DPU array + extra DDR bandwidth

The 5.3–5.8 W gap between SOM-internal readings and external rail measurements represents board-level overhead: switching-regulator conversion losses in the 12 V → SOM voltage chain, Ethernet PHY power, USB hub, and other peripheral circuitry. This gap is important for system designers: it defines the minimum overhead that any carrier board design for the K26 SOM must account for.

5.4 Energy Efficiency: Inferences per Joule

The most operationally useful metric for battery-powered deployments is not FPS or watts in isolation, but their ratio: how many inferences can you perform per unit of energy? Table 6 computes this using the system-level power (10.13 W under DPU load) and the production-mode bypass latencies from our companion profiling study.

Table 6. Energy efficiency metrics computed from system-level power and bypass-mode DPU latency.

ModelLatency (ms)FPSSystem Power (W)Inferences / JoulemJ / Inference
MobileNet V15.343187.210.1318.554.1
MobileNet V25.935168.510.1316.660.2
ResNet-5016.07562.210.136.1163.0

MobileNet V1 delivers 18.5 inferences per joule at 54.1 mJ per inference — the most energy-efficient option. MobileNet V2 is close behind at 16.6 inferences per joule. ResNet-50, despite being 3× slower, still achieves 6.1 inferences per joule because the DPU’s power increment (0.98 W) is modest regardless of model complexity; the system’s idle power (9.15 W) dominates the total.

This last point is critical for system design: because idle power is 90% of total system power, the most effective way to improve energy efficiency is not to optimise the DPU workload but to reduce board-level quiescent consumption — by disabling unused peripherals, power-gating idle subsystems, or designing a leaner custom carrier board.

6. Applying This Methodology to Your Board

The measurement approach described here is not specific to the Kria KV260. Any embedded system with a DC power input can be characterised the same way. Here is the step-by-step procedure:

  1. Identify the main DC supply rail and its voltage. For the Kria KV260 this is +12 V; for Raspberry Pi it would be +5 V; for Jetson Nano, +5 V at the barrel jack or USB-C.
  2. Select a shunt resistor value that produces a measurable voltage drop at your expected current without significantly affecting the supply. A good rule of thumb: the shunt voltage drop should be 1–5% of the supply voltage. For a 12 V supply at ~0.8 A, 0.1 Ω gives 80 mV (0.7%).
  3. Insert the shunt in series with the positive rail. Use a multimeter in DC current mode as a parallel verification. Connect an oscilloscope across the shunt for transient capture.
  4. Record the idle baseline with the system fully booted but no AI workload running.
  5. Run your AI workload (preferably a synthetic benchmark with fixed iteration count) and record peak current from the multimeter and the oscilloscope waveform.
  6. Compute: ΔP = V_supply × (I_load − I_idle). This isolates the power attributable to the AI accelerator.
  7. Derive inferences per joule: FPS ÷ P_total. Derive mJ per inference: (P_total ÷ FPS) × 1000.

If your board has on-board power sensors (many SoMs do), read those simultaneously. The gap between on-board and external measurements quantifies your board-level overhead — actionable data for carrier board redesign.

7. Limitations and Caveats

  • The shunt resistor measurement captures average and peak power but not instantaneous sub-microsecond transients. For high-resolution power profiling, a dedicated power analyser (e.g., Keysight N6705C) or high-bandwidth current probe would be required.
  • SOM-internal sensors have limited update rates (typically tens of milliseconds) and cannot capture per-layer power variation within a single inference pass.
  • The 10.13 W system-level figure excludes the camera (Intel RealSense D435 was disconnected during benchmark runs). With the camera attached and streaming, total system power would increase by the camera’s own consumption (~1.5–2 W for the D435 over USB 3.0).
  • Power measurements were taken at a single ambient temperature (~23°C). In deployed systems, elevated temperatures increase leakage current and can raise both idle and active power.
  • We measured only classification workloads. Object detection and segmentation models with larger feature maps may exhibit different DDR bandwidth patterns and correspondingly different power profiles.

8. Conclusion

Power measurement for edge AI does not require expensive lab equipment. With a $2 shunt resistor, a budget multimeter, and a portable oscilloscope, engineers can build a complete power characterisation test stand that reveals both steady-state consumption and dynamic transient behaviour.

On the Kria KV260, this methodology revealed that the DPU compute array adds only 0.98 W to a 9.15 W idle baseline — meaning the accelerator itself is remarkably power-efficient, while the board-level overhead dominates total consumption. The SOM-internal sensors report 4.3–4.8 W, leaving a 5+ W gap attributable to voltage regulators and peripherals. For battery-powered applications, this gap — not the DPU itself — is the primary target for power optimisation.

The derived metric of inferences per joule (18.5 for MobileNet V1, 6.1 for ResNet-50 at system level) provides a directly actionable figure for battery life estimation. For a 50 Wh battery pack, MobileNet V1 at 10.13 W system power would sustain continuous inference for approximately 4.9 hours — roughly 33 million inferences.

We encourage the edge AI community to adopt dual-level power measurement (on-board sensors + external supply rail) as a standard reporting practice alongside latency and throughput. Only with all three metrics — speed, accuracy, and energy cost — can engineers make informed deployment decisions.

References

[1] AMD. Kria SOMs. https://www.amd.com/en/products/system-on-modules/kria.html

[2] AMD. Vision AI DPU-PYNQ. https://www.amd.com/en/developer/resources/kria-apps/vision-ai-dpu-pynq.html

[3] AMD. Zynq UltraScale+ MPSoC Data Sheet: Overview (DS891).

[4] AMD Xilinx. Deep-Learning Processor Unit — Vitis AI 3.0 Documentation.

[5] AMD Xilinx. PYNQ: Python Productivity for AMD Adaptive Computing Platforms. http://www.pynq.io/

[6] Intel Corporation. Intel RealSense SDK 2.0. https://github.com/IntelRealSense/librealsense

[7] Gubochkin, I., Gorshkova, I., Salovskii, P. “Measuring What Actually Matters: Per-Layer DPU Profiling on Kria KV260 with MobileNet and ResNet-50.” dAIEDGE Technical Article #1, March 2026.

Table of Contents
Go to Top