Skip to content

Vector Masking#104

Open
rgiunti wants to merge 26 commits into
pulp-platform:mainfrom
FondazioneChipsIT:test/maoyuan/masking
Open

Vector Masking#104
rgiunti wants to merge 26 commits into
pulp-platform:mainfrom
FondazioneChipsIT:test/maoyuan/masking

Conversation

@rgiunti

@rgiunti rgiunti commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

PR purpose

This PR introduces the vector masking for the Spatz supported instructions. It fixes and extend the opened PR #45 .

Main changes

Vector masking implementation follows the same scheme in all units then each unit specializes it to its own datapath.

  • Mask fetch. On a masked instruction (vm=0) whose mask has not been fetched yet, a dedicated FSM state reuses the VRF read ports to fetch v0 and latches it into a local register.

  • Datapath freeze during the fetch. While v0 is being read, the unit is stalled. This both prevents operating with a stale mask and avoids conflicts on the shared VRF read ports. In-flight memory responses are never dropped: they are buffered in the reorder buffer until the flow resumes.

  • Byte-granular expansion. The latched v0 is combinationally expanded into a VLEN-wide, byte-granular mask (vm_masking): each v0 bit replicated over 1/2/4/8 bytes according to vsew. The slice matching the instruction's current position in the vector is selected from it (slice base derived from the progress counters, rounded to the VRF word size).

  • Mask application. For each destination word, the mask slice is ANDed into the write-byte-enable: not masked bytes are simply not written, so they keep their previous value (mask-undisturbed semantics). This applies to loads (vrf_req_d.wbe = load_wbe & vm_wbe), arithmetic write-back (vreg_wbe), and slides (slide_wbe & vm_masking[...]). For stores, the same idea is applied to the memory strobe (mem_req_strb = store_strb & vm_strb). Since the store data is rotated for strided/indexed and unaligned accesses, the mask slice goes through the same rotations, so each mask bit always stays attached to its byte.
    Mask-producing compares (VFCMP) are the only exception: the destination is a mask register, so the mask is ANDed into the result bits themselves (result & v0_bit) rather than into the write-byte-enable.

@rgiunti rgiunti self-assigned this Jun 12, 2026
@rgiunti rgiunti force-pushed the test/maoyuan/masking branch from f173c36 to d3c5f33 Compare June 12, 2026 09:26
@rgiunti rgiunti requested a review from DiyouS June 12, 2026 09:43
@DiyouS

DiyouS commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

One question, do you have some synthesis results on the area overhead for the mask?

@rgiunti

rgiunti commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

One question, do you have some synthesis results on the area overhead for the mask?

I'm make the synthesis right now in order to compare it with the reports obtained starting from the actual state of the main branch. Could you please tell me where the CI is failing right now? because I used my local CI before opening the draft PR and all seemed ok

@DiyouS

DiyouS commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

One question, do you have some synthesis results on the area overhead for the mask?

I'm make the synthesis right now in order to compare it with the reports obtained starting from the actual state of the main branch. Could you please tell me where the CI is failing right now? because I used my local CI before opening the draft PR and all seemed ok

Not seems an actual failure to me. I will try to rerun the CI

@DiyouS

DiyouS commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

All fake failures have resolved in new CI run, but there is still an issue with doublebw configuration in FFT kernel:
10 - spatzBenchmarks-rtl-dp-fft_M128_N2 (Timeout)

@rgiunti

rgiunti commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

All fake failures have resolved in new CI run, but there is still an issue with doublebw configuration in FFT kernel: 10 - spatzBenchmarks-rtl-dp-fft_M128_N2 (Timeout)

Ok, I'm going to debug the issue. Meanwhile I obtained the synthesis results with GF22 at 1GHz and there are not timing loops and time violations. Regarding the area there is an overhead of 1,54% with masking.

@DiyouS

DiyouS commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

All fake failures have resolved in new CI run, but there is still an issue with doublebw configuration in FFT kernel: 10 - spatzBenchmarks-rtl-dp-fft_M128_N2 (Timeout)

Ok, I'm going to debug the issue. Meanwhile I obtained the synthesis results with GF22 at 1GHz and there are not timing loops and time violations. Regarding the area there is an overhead of 1,54% with masking.

Sounds great! Thanks a lot!

Comment thread hw/ip/spatz/src/spatz_vfu.sv Outdated
if(spatz_req.op_arith.is_reduction == 1'b1) begin
case(spatz_req.op)
VADD: // VREDSUM_VS, VFREDUSUM_VS, VFREDOSUM_VS
reduction_useless_value = '0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rename it to reduction_neutral_value or reduction_identity_value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Navaneeth-KunhiPurayil

Copy link
Copy Markdown
Contributor

Thanks for the changes, Looks good to me!

@rgiunti

rgiunti commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the changes, Looks good to me!

Perfect, thanks! I just need some time to fix the benchmark issue because I found an error in the indexed store. I hope to fix it soon, then we can perform another CI run.

@rgiunti rgiunti force-pushed the test/maoyuan/masking branch 2 times, most recently from 4cfe2ee to 36d78f8 Compare June 22, 2026 10:38
@rgiunti rgiunti marked this pull request as ready for review June 22, 2026 10:39
@rgiunti rgiunti force-pushed the test/maoyuan/masking branch from 36d78f8 to 87bc3e9 Compare June 22, 2026 15:00
@rgiunti rgiunti force-pushed the test/maoyuan/masking branch from 87bc3e9 to fce34f6 Compare June 23, 2026 07:18
@DiyouS

DiyouS commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Ventaglio tests failed. the computed results are somehow all zeros

@rgiunti

rgiunti commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Ventaglio tests failed. the computed results are somehow all zeros

The mistake I did was not to add ventaglio cfg to my local CI when I performed the test of Ventaglio with masking on top. Unfortunately since I'm not having the possibility to work with your CI I'm exposed to this kind of errors and I've to remember to update manually my CI every time a new cfg is added. I'm sorry, my bad. I'll try to fix the issues as soon as I can.

@DiyouS

DiyouS commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Ventaglio tests failed. the computed results are somehow all zeros

The mistake I did was not to add ventaglio cfg to my local CI when I performed the test of Ventaglio with masking on top. Unfortunately since I'm not having the possibility to work with your CI I'm exposed to this kind of errors and I've to remember to update manually my CI every time a new cfg is added. I'm sorry, my bad. I'll try to fix the issues as soon as I can.

Actually, we can think about moving some benchmarks to GitHub so that external people from Unibo and ChipsIT can also trigger CI. This should be much easier for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants