Skip to content

Commit 20e4053

Browse files
committed
dilithium: faster Montgomery q^-1 via 64-bit-widened multiply on 16-bit CPUs
The TI cl2000 (C2000 C28x) compiler miscompiles the 32x32->32 low multiply used for the q^-1 step of mldsa_mont_red() - verified on a TMS320F28P550SJ, the ML-DSA-87 verify KAT fails (res=0) - but compiles the 32x64->64 widening multiply correctly. Compute the q^-1 product through the 64-bit path (MLDSA_MUL_QINV_WIDE64): correct on any conforming compiler and, on the C28x, ~4% faster than the shift-based reduction (305 vs 317 ms/op for ML-DSA-87 verify). dilithium.h auto-selects it for WC_16BIT_CPU and leaves the q multiply enabled (it compiles correctly); a user can still force the shift form with MLDSA_MUL_QINV_SLOW / MLDSA_MUL_Q_SLOW. Validated on hardware for keygen+sign+verify (round-trip res=1). No effect on 8-bit/>=32-bit-int builds.
1 parent 12a3dce commit 20e4053

2 files changed

Lines changed: 14 additions & 12 deletions

File tree

wolfcrypt/src/wc_mldsa.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5594,7 +5594,13 @@ static void mldsa_vec_use_hint(sword32* w1, byte k, sword32 gamma2,
55945594
static sword32 mldsa_mont_red(sword64 a)
55955595
{
55965596
#ifndef MLDSA_MUL_QINV_SLOW
5597+
#ifdef MLDSA_MUL_QINV_WIDE64
5598+
/* Low 32 bits of the q^-1 product via the 64-bit multiply; see
5599+
* MLDSA_MUL_QINV_WIDE64 in dilithium.h. */
5600+
sword64 t = (sword32)((sword64)(sword32)a * (sword64)MLDSA_QINV);
5601+
#else
55975602
sword64 t = (sword32)((sword32)a * (sword32)MLDSA_QINV);
5603+
#endif
55985604
#else
55995605
sword64 t = (sword32)((sword32)a + (sword32)((sword32)a << 13) -
56005606
(sword32)((sword32)a << 23) + (sword32)((sword32)a << 26));

wolfssl/wolfcrypt/dilithium.h

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -306,19 +306,15 @@
306306
#endif
307307
#endif
308308

309-
/* On a 16-bit-int CPU (WC_16BIT_CPU) the toolchain synthesizes 32-/64-bit
310-
* multiplies, and some (e.g. TI cl2000 for the C2000 C28x) miscompile the
311-
* multiply-based Montgomery reduction in mldsa_mont_red(). The shift-based
312-
* reduction is mathematically identical (q = 2^23 - 2^13 + 1, and the q^-1
313-
* expansion likewise), avoids the wide multiply, and is also cheaper on such
314-
* targets. Select it by default there; a user can still override either
315-
* macro explicitly. */
309+
/* MLDSA_MUL_QINV_WIDE64: compute the q^-1 step of mldsa_mont_red() through the
310+
* 32x64->64 widening multiply instead of a 32x32->32 low multiply. Some 16-bit
311+
* toolchains miscompile the 32x32->32 form (e.g. TI cl2000 on the C28x); the
312+
* widening form is correct on any conforming compiler and no slower. Default
313+
* it on for WC_16BIT_CPU; a user can force the shift form with
314+
* MLDSA_MUL_QINV_SLOW / MLDSA_MUL_Q_SLOW. */
316315
#if defined(WC_16BIT_CPU)
317-
#ifndef MLDSA_MUL_QINV_SLOW
318-
#define MLDSA_MUL_QINV_SLOW
319-
#endif
320-
#ifndef MLDSA_MUL_Q_SLOW
321-
#define MLDSA_MUL_Q_SLOW
316+
#if !defined(MLDSA_MUL_QINV_SLOW) && !defined(MLDSA_MUL_QINV_WIDE64)
317+
#define MLDSA_MUL_QINV_WIDE64
322318
#endif
323319
#endif
324320

0 commit comments

Comments
 (0)