Help understanding how to deploy and validate the model and how to follow one of the paper's implementations! #1445

norahatem · 2026-02-25T18:29:52Z

norahatem
Feb 25, 2026

Hello everyone,
First of all, thank you so much to everyone who has contributed to this amazing project, I have learnt a lot from it and it is super amazing project. Thanks for making it available and providing docs and tutorials!

I have some questions/confusions and I was hoping someone could help point me in the right direction. I will split them into two parts:

I need help understanding the different available paths going forward after building the project and exporting the IP.
I need some help understanding how the model implemented in the paper: "Fast convolutional neural networks on FPGAs with hls4ml" was actually synthesised into a pynq-z2 FPGA with the mentioned resource utilisation.

Please feel free to just reply to a tiny bit of this if you know the answer/reason and can help me understand!

Some context:
First, I have been able to produce an IP for the model I have. I will attach the model plot below. I am using the Nexys A7 board, part: xc7a100t-csg324-1

I trained a classification model using qkeras, setting the kernel widths to 6 bits and used no bias. Activations are also limited to 6 bits (relu). I don't use batchnorm layers in the quantised model nor do I use a softmax output layer.

In the configurations, I set the max_precision to 'fixed<16,5,TRN, WRAP,0>', default reuse factor to 50 and strategy to resource. I edit the reuse factors for the conv and dense layers to different values since I get warnings about invalid reuse factors for these layers. I do (78,48,48,40,48,32,32) consecutively. The choice for the reuse factor is really arbitrary but the network fits!

I use the hls_model.build(export=True,synth=True,vsynth=True,cosim=True) to build and export the IP.

The first bit
One thing that I am not entirely sure about is the difference between the resource utilisation estimate from the C synthesis and the Vivado synthesis report. I understand that the actual resources used may be less than the estimates, but the actual DSPs used are 0, and in the resource utilisation estimate, 170 units are estimated. I am slightly confused and I was wondering if this is actually normal behaviour.

Additionally, I have realised that the design implemented (using the build command) is using AXI communication. The samples are also not fed into the model one sample at a time; it takes a number of clock cycles to complete feeding in one sample (input shape is 64x13 and it takes at least 64 cycles to feed one sample into the network). Is that something normal in the implementation of the models, or does it have something to do with the shape of my inputs?
Here is how the project's module header:

module myproject (
        input_1_TDATA,
        layer25_out_TDATA,
        ap_clk,
        ap_rst_n,
        input_1_TVALID,
        input_1_TREADY,
        ap_start,
        layer25_out_TVALID,
        layer25_out_TREADY,
        ap_done,
        ap_ready,
        ap_idle
);


input  [207:0] input_1_TDATA;
output  [159:0] layer25_out_TDATA;
input   ap_clk;
input   ap_rst_n;
input   input_1_TVALID;
output   input_1_TREADY;
input   ap_start;
output   layer25_out_TVALID;
input   layer25_out_TREADY;
output   ap_done;
output   ap_ready;
output   ap_idle;

I also want to test the RTL model in simulation and deploy it onto the hardware (the nexys a7 board I have!). I do not wish to integrate the model within a bigger design. I am really lost on what my next step should be. I did some research, and one solution I found is using the microblaze as a softcore and using it as the controller for the AXI to handle the data transfer. I am currently considering this as a potential solution, but I was wondering if there are any other routes that would allow me to communicate with an external PC and handle the data transfer from there? I would really appreciate it if someone could help me understand the different routes that could be taken from this point onwards. This is going to be super helpful in helping me make a decision.
-Apologies if I am not able to make this question clear, but as I said, I am really lost on what the next step should be, hence why it may translate into a vague question :) -

The second bit
In the following paper: "Fast convolutional neural networks on FPGAs with hls4ml", doi: 10.1088/2632-2153/ac0ea1 The authors mention the following:

Although particle physics experiments mostly use large FPGAs, the hls4ml library can be readily used for smaller FPGAs, like those found on system-on-chip (SoC) or internet-of-things (IoT) devices, through increasing the reuse factor. To demonstrate this, we synthesize and deploy the smallest model that retains the original model accuracy, QP 7-bit, onto a low-cost TUL PYNQ-Z2 development board, equipped with a Xilinx Zynq XC7Z020 SoC (FPGA part number xc7z020clg400-1). This FPGA is signiﬁcantly smaller than the Xilinx Virtex UltraScale+ VU9P, and consists of 13,300 logic slices, each with four 6-input LUTs and 8 FFs, 630 kB of BRAM, and 220 DSP slices. As expected, a large reuse factor is needed in order to ﬁt the QP 7-bit model onto the Zynq XC7Z020. For a clock frequency of 100MHz, the resulting inference latency is 171 µs and up to 2,831 image classiﬁcations per second. This implementation uses a total of 91% of the LUTs, 97% of the DSPs, 33% of the FFs, and 44% of the BRAM. A summary is provided in Table 4. This demonstrates the ﬂexibility of hls4ml to accommodate SoC/IoT use cases, which can demand smaller FPGAs and tolerate millisecond latencies.

Is this perhaps utilising the Resource strategy or the Latency strategy? I tried experimenting with different reuse factors and different strategies, but I am not sure how the paper's implementation got around to using 97% of the DSPs? Does this perhaps include setting the bit widths to larger widths? Since it was mentioned in the paper that below a specific bit width, the HLS tool performs multiplications using LUTs, so how did they get pretty much all the DSPs working?

Is there a code repo that includes the code used to report results in any of the papers that are affiliated with hls4ml? That would be helpful in understanding the papers' implementations more!

Thank you so much in advance for any help you could provide. I tried to provide as much useful context as possible, but it is possible I may have missed to include important bits of information that could help people help me. If that is the case, kindly let me know.

Nora

model plot:

Answered by bo3z

Mar 3, 2026

Hi @norahatem,

Regarding resources, HLS estimates can be quite off. The "golden truth" should be Vivado synthesis reports, or even better, after Place and Route (so-called Implementation on AMD/Xilinx FPGAs). There shouldn't be large differences between synthesis and PnR, but between Vivado synthesis and HLS reports we have seen as much as 5x difference in resources. Though these differences are typically for LUTs and FFs, they can happen for DSPs as well (it will depend on the technology mapping and the board you are targetting).
Regarding sample input latency, it will depend on your input shape. hls4ml partitions the inputs/layer intermediate results in streaming CNNs along a partic…

View full answer

bo3z · 2026-03-03T08:18:27Z

bo3z
Mar 3, 2026
Maintainer

Hi @norahatem,

Regarding resources, HLS estimates can be quite off. The "golden truth" should be Vivado synthesis reports, or even better, after Place and Route (so-called Implementation on AMD/Xilinx FPGAs). There shouldn't be large differences between synthesis and PnR, but between Vivado synthesis and HLS reports we have seen as much as 5x difference in resources. Though these differences are typically for LUTs and FFs, they can happen for DSPs as well (it will depend on the technology mapping and the board you are targetting).
Regarding sample input latency, it will depend on your input shape. hls4ml partitions the inputs/layer intermediate results in streaming CNNs along a particular dimension, which then affects HLS scheduling. For more details, I suggest having a look at the conv_stream implementation.
Deploying the model on practical hardware will depend on the underlying FPGA (Nexys 7) in your case. There are typically two routes: (1) writing your own DMA logic and driver (potentially based on vendor IP blocks) or relying on pre-existing FPGA shells that set-up data movement, control logic and interaction with the host PC. And these largely change depending on the FPGA board (for e.g., there are different DMA engines for data center and development cards).

For your specific case, I would suggest to take a look into PR #1376, which adds support for running hls4ml on various edge and development boards, which also includes links to tutorial on how to extend it to Nexys 7, in case not supported out of the box.

2 replies

norahatem Mar 26, 2026
Author

Hi @bo3z,

Thank you so much for your reply, and apologies for the late reply; I have been meaning to reply much sooner.

From my understanding, the tutorials for extending the PR rely on platforms (an XPFM or XSA file) that include a PS. Since the nexys a7 is a pure FPGA and I didn’t really want to use a Microblaze soft core, I instead set up the DMA logic and driver manually using the JTAG-to-AXI Master and AXI DMA IPs provided by Xilinx, along with a few additional IPs that I needed for my specific case.

I really appreciate your help; it was very useful in guiding me through this.

Regards,
Nora

bo3z Apr 1, 2026
Maintainer

Great to hear!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help understanding how to deploy and validate the model and how to follow one of the paper's implementations! #1445

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help understanding how to deploy and validate the model and how to follow one of the paper's implementations! #1445

Uh oh!

norahatem Feb 25, 2026

Replies: 1 comment · 2 replies

Uh oh!

bo3z Mar 3, 2026 Maintainer

Uh oh!

norahatem Mar 26, 2026 Author

Uh oh!

bo3z Apr 1, 2026 Maintainer

norahatem
Feb 25, 2026

Replies: 1 comment 2 replies

bo3z
Mar 3, 2026
Maintainer

norahatem Mar 26, 2026
Author

bo3z Apr 1, 2026
Maintainer