Distributed Arithmetic strategy for Dense, Conv1/2D, and EinsumDense#1191
Conversation
softmax fix fix softmax parsing issue ckpt softmax fix fix table size after overriding inv_inp_t
|
pre-commit.ci autofix |
|
pre-commit.ci autofix |
JanFSchulte
left a comment
There was a problem hiding this comment.
Just leaving a few preliminary comments, there are still some parts that I need to look at in a bit more detail.
| } else if (CONFIG_T::strategy == nnet::resource || CONFIG_T::strategy == nnet::resource_unrolled) { | ||
| conv_2d_resource_cl<data_T, res_T, CONFIG_T>(data, res, weights, biases); | ||
| } else { | ||
| assert(false && "Invalid strategy for conv_2d_cl"); |
There was a problem hiding this comment.
Shouldn't this be caught earlier and not at runtime of the HLS code?
| void maximum(input1_T data1[CONFIG_T::n_elem], input2_T data2[CONFIG_T::n_elem], res_T res[CONFIG_T::n_elem]) { | ||
| for (int ii = 0; ii < CONFIG_T::n_elem; ii++) { | ||
| res[ii] = (data1[ii] > data2[ii]) ? static_cast<res_T>(data1[ii]) : static_cast<res_T>(data2[ii]); | ||
| res[ii] = (data1[ii] > data2[ii]) ? static_cast<res_T>(data1[ii]) : static_cast<res_T>(data2[ii]); |
There was a problem hiding this comment.
Why is this line duplicated?
| self.add_weights_variable(name='table', var_name='table{index}', precision=table_t, data=self.table) | ||
| self.table = self.attributes['table_data'] | ||
| self.attributes['table_size'] = len(self.table) | ||
| self.table_size = len(self.table) |
There was a problem hiding this comment.
Why do we need to table_size both in the attributes dict and as a standalone attribute?
| node.model.config.layer_name_precision[node.name] = str(new_type) | ||
| return False | ||
|
|
||
| if not all(isinstance(l, FixedPointQuantizer) for l in out_layers): |
There was a problem hiding this comment.
Isn't this case covered by the assert above?
| Ks, Is, Fs = Ks[0], Is[0], Fs[0] # remove batch dimension | ||
| else: | ||
| Ks = np.ones(inp_shape, dtype=np.int8) | ||
| Is = Fs = np.full(inp_shape, 126, dtype=np.int8) |
There was a problem hiding this comment.
Out of curiosity, why 8 bit in this fallback option?
There was a problem hiding this comment.
I used 8 bit for bw everywhere for the bit_exact.py file.
|
|
||
|
|
||
| def _get_input_kif(node: Layer): | ||
| """Get the input k, i, f to a layer. |
There was a problem hiding this comment.
I think for future maintainability of this code it would be good to document somewhere (here or elsewhere, what k, i, f, and b stand for.
|
|
||
| layer_config = None | ||
| if model_arch['class_name'] == 'Sequential': | ||
| print('Interpreting Sequential') |
There was a problem hiding this comment.
As a general comment for this PR: I personally like that currently hls4ml is giving the user some feedback on what it is actually doing at a given time, especially when running multiple steps in sequence. And printing out the layer structure it creates is also helpful as a sanity check. So I would prefer not to remove all these print outs.
|
The latest set of pytests showed some actual issues, so I want to re-run them one more time. Otherwise I'm ok with merging, my comments are mostly nitpicky. |
|
Apart from the issue with test_merge that we're investigating, the others seem to be stability issues (not actual issues, but do need to be addressed too). |
|
There was also an issue with da4ml, but I see that Chang bumped the version of that and it is gone. So all good from my side. |
* added Cropping1D and Cropping2D keras layers support * removed .bak templates files * added cropping layers tests for vivado and vitis * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
|
The test_merge failure and the other precision test failure will be addressed in a subsequent PR. Merging this. Huge thanks to Chang for the monumental effort of building all of it. |
Description
This PR introduces a new strategy,
distributed_arithmeticforDense(io parallel / stream)Conv1/2D(io parallel / stream)EinsumDense(io parallel)With this strategy, all
matmullike operations in there layers are decomposed into optimized adder trees. Heavy lifting tasks are offloaded toda4ml, where everything is jitted withnumba. There,CMVMproblem is optimized with greedy common subexpression elimination. A reduction ofLUTconsumption of over30%is frequently seen whenWRAPis used as overflow mode with improved latency.DSPconsumption will almost always be 0 with this strategy.This PR depends on the
s-quark-prand includes all changes made there. (QEinsumDensenot available otherwise)Type of change
Tests
Tests added to
test_hgq_layers.pyandtest_einsum_dense.py.EinsumDensetest will NOT be triggered in the current configuration due tokeras v3dependency.Checklist