-
|
Hello everyone, I have some questions/confusions and I was hoping someone could help point me in the right direction. I will split them into two parts:
Please feel free to just reply to a tiny bit of this if you know the answer/reason and can help me understand! Some context: I trained a classification model using qkeras, setting the kernel widths to 6 bits and used no bias. Activations are also limited to 6 bits (relu). I don't use batchnorm layers in the quantised model nor do I use a softmax output layer. In the configurations, I set the max_precision to 'fixed<16,5,TRN, WRAP,0>', default reuse factor to 50 and strategy to resource. I edit the reuse factors for the conv and dense layers to different values since I get warnings about invalid reuse factors for these layers. I do (78,48,48,40,48,32,32) consecutively. The choice for the reuse factor is really arbitrary but the network fits! I use the hls_model.build(export=True,synth=True,vsynth=True,cosim=True) to build and export the IP. The first bit Additionally, I have realised that the design implemented (using the build command) is using AXI communication. The samples are also not fed into the model one sample at a time; it takes a number of clock cycles to complete feeding in one sample (input shape is 64x13 and it takes at least 64 cycles to feed one sample into the network). Is that something normal in the implementation of the models, or does it have something to do with the shape of my inputs? I also want to test the RTL model in simulation and deploy it onto the hardware (the nexys a7 board I have!). I do not wish to integrate the model within a bigger design. I am really lost on what my next step should be. I did some research, and one solution I found is using the microblaze as a softcore and using it as the controller for the AXI to handle the data transfer. I am currently considering this as a potential solution, but I was wondering if there are any other routes that would allow me to communicate with an external PC and handle the data transfer from there? I would really appreciate it if someone could help me understand the different routes that could be taken from this point onwards. This is going to be super helpful in helping me make a decision. The second bit
Is this perhaps utilising the Resource strategy or the Latency strategy? I tried experimenting with different reuse factors and different strategies, but I am not sure how the paper's implementation got around to using 97% of the DSPs? Does this perhaps include setting the bit widths to larger widths? Since it was mentioned in the paper that below a specific bit width, the HLS tool performs multiplications using LUTs, so how did they get pretty much all the DSPs working? Is there a code repo that includes the code used to report results in any of the papers that are affiliated with hls4ml? That would be helpful in understanding the papers' implementations more! Thank you so much in advance for any help you could provide. I tried to provide as much useful context as possible, but it is possible I may have missed to include important bits of information that could help people help me. If that is the case, kindly let me know. Nora |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
Hi @norahatem,
For your specific case, I would suggest to take a look into PR #1376, which adds support for running hls4ml on various edge and development boards, which also includes links to tutorial on how to extend it to Nexys 7, in case not supported out of the box. |
Beta Was this translation helpful? Give feedback.

Hi @norahatem,
Regarding resources, HLS estimates can be quite off. The "golden truth" should be Vivado synthesis reports, or even better, after Place and Route (so-called Implementation on AMD/Xilinx FPGAs). There shouldn't be large differences between synthesis and PnR, but between Vivado synthesis and HLS reports we have seen as much as 5x difference in resources. Though these differences are typically for LUTs and FFs, they can happen for DSPs as well (it will depend on the technology mapping and the board you are targetting).
Regarding sample input latency, it will depend on your input shape. hls4ml partitions the inputs/layer intermediate results in streaming CNNs along a partic…