Commit 20e4053
committed
dilithium: faster Montgomery q^-1 via 64-bit-widened multiply on 16-bit CPUs
The TI cl2000 (C2000 C28x) compiler miscompiles the 32x32->32 low multiply
used for the q^-1 step of mldsa_mont_red() - verified on a TMS320F28P550SJ,
the ML-DSA-87 verify KAT fails (res=0) - but compiles the 32x64->64 widening
multiply correctly. Compute the q^-1 product through the 64-bit path
(MLDSA_MUL_QINV_WIDE64): correct on any conforming compiler and, on the C28x,
~4% faster than the shift-based reduction (305 vs 317 ms/op for ML-DSA-87
verify). dilithium.h auto-selects it for WC_16BIT_CPU and leaves the q
multiply enabled (it compiles correctly); a user can still force the shift
form with MLDSA_MUL_QINV_SLOW / MLDSA_MUL_Q_SLOW. Validated on hardware for
keygen+sign+verify (round-trip res=1). No effect on 8-bit/>=32-bit-int builds.1 parent 12a3dce commit 20e4053
2 files changed
Lines changed: 14 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5594 | 5594 | | |
5595 | 5595 | | |
5596 | 5596 | | |
| 5597 | + | |
| 5598 | + | |
| 5599 | + | |
| 5600 | + | |
| 5601 | + | |
5597 | 5602 | | |
| 5603 | + | |
5598 | 5604 | | |
5599 | 5605 | | |
5600 | 5606 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
306 | 306 | | |
307 | 307 | | |
308 | 308 | | |
309 | | - | |
310 | | - | |
311 | | - | |
312 | | - | |
313 | | - | |
314 | | - | |
315 | | - | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
316 | 315 | | |
317 | | - | |
318 | | - | |
319 | | - | |
320 | | - | |
321 | | - | |
| 316 | + | |
| 317 | + | |
322 | 318 | | |
323 | 319 | | |
324 | 320 | | |
| |||
0 commit comments