Vector Masking#104
Conversation
f173c36 to
d3c5f33
Compare
|
One question, do you have some synthesis results on the area overhead for the mask? |
I'm make the synthesis right now in order to compare it with the reports obtained starting from the actual state of the main branch. Could you please tell me where the CI is failing right now? because I used my local CI before opening the draft PR and all seemed ok |
Not seems an actual failure to me. I will try to rerun the CI |
|
All fake failures have resolved in new CI run, but there is still an issue with doublebw configuration in FFT kernel: |
Ok, I'm going to debug the issue. Meanwhile I obtained the synthesis results with GF22 at 1GHz and there are not timing loops and time violations. Regarding the area there is an overhead of 1,54% with masking. |
Sounds great! Thanks a lot! |
| if(spatz_req.op_arith.is_reduction == 1'b1) begin | ||
| case(spatz_req.op) | ||
| VADD: // VREDSUM_VS, VFREDUSUM_VS, VFREDOSUM_VS | ||
| reduction_useless_value = '0; |
There was a problem hiding this comment.
Maybe we can rename it to reduction_neutral_value or reduction_identity_value?
|
Thanks for the changes, Looks good to me! |
Perfect, thanks! I just need some time to fix the benchmark issue because I found an error in the indexed store. I hope to fix it soon, then we can perform another CI run. |
4cfe2ee to
36d78f8
Compare
36d78f8 to
87bc3e9
Compare
87bc3e9 to
fce34f6
Compare
|
Ventaglio tests failed. the computed results are somehow all zeros |
The mistake I did was not to add ventaglio cfg to my local CI when I performed the test of Ventaglio with masking on top. Unfortunately since I'm not having the possibility to work with your CI I'm exposed to this kind of errors and I've to remember to update manually my CI every time a new cfg is added. I'm sorry, my bad. I'll try to fix the issues as soon as I can. |
Actually, we can think about moving some benchmarks to GitHub so that external people from Unibo and ChipsIT can also trigger CI. This should be much easier for you |
PR purpose
This PR introduces the vector masking for the Spatz supported instructions. It fixes and extend the opened PR #45 .
Main changes
Vector masking implementation follows the same scheme in all units then each unit specializes it to its own datapath.
Mask fetch. On a masked instruction (
vm=0) whose mask has not been fetched yet, a dedicated FSM state reuses the VRF read ports to fetch v0 and latches it into a local register.Datapath freeze during the fetch. While v0 is being read, the unit is stalled. This both prevents operating with a stale mask and avoids conflicts on the shared VRF read ports. In-flight memory responses are never dropped: they are buffered in the reorder buffer until the flow resumes.
Byte-granular expansion. The latched v0 is combinationally expanded into a VLEN-wide, byte-granular mask (
vm_masking): each v0 bit replicated over 1/2/4/8 bytes according to vsew. The slice matching the instruction's current position in the vector is selected from it (slice base derived from the progress counters, rounded to the VRF word size).Mask application. For each destination word, the mask slice is ANDed into the write-byte-enable: not masked bytes are simply not written, so they keep their previous value (mask-undisturbed semantics). This applies to loads (
vrf_req_d.wbe = load_wbe & vm_wbe), arithmetic write-back (vreg_wbe), and slides (slide_wbe & vm_masking[...]). For stores, the same idea is applied to the memory strobe (mem_req_strb = store_strb & vm_strb). Since the store data is rotated for strided/indexed and unaligned accesses, the mask slice goes through the same rotations, so each mask bit always stays attached to its byte.Mask-producing compares (VFCMP) are the only exception: the destination is a mask register, so the mask is ANDed into the result bits themselves (
result & v0_bit) rather than into the write-byte-enable.