dilithium: faster Montgomery q^-1 via 64-bit-widened multiply on 16-bit CPUs

dgarske · dgarske · commit 20e4053a9fa0 · 2026-06-18T15:26:50.000-07:00
The TI cl2000 (C2000 C28x) compiler miscompiles the 32x32-&gt;32 low multiply
used for the q^-1 step of mldsa_mont_red() - verified on a TMS320F28P550SJ,
the ML-DSA-87 verify KAT fails (res=0) - but compiles the 32x64-&gt;64 widening
multiply correctly. Compute the q^-1 product through the 64-bit path
(MLDSA_MUL_QINV_WIDE64): correct on any conforming compiler and, on the C28x,
~4% faster than the shift-based reduction (305 vs 317 ms/op for ML-DSA-87
verify). dilithium.h auto-selects it for WC_16BIT_CPU and leaves the q
multiply enabled (it compiles correctly); a user can still force the shift
form with MLDSA_MUL_QINV_SLOW / MLDSA_MUL_Q_SLOW. Validated on hardware for
keygen+sign+verify (round-trip res=1). No effect on 8-bit/&gt;=32-bit-int builds.
diff --git a/wolfcrypt/src/wc_mldsa.c b/wolfcrypt/src/wc_mldsa.c
@@ -5594,7 +5594,13 @@ static void mldsa_vec_use_hint(sword32* w1, byte k, sword32 gamma2,
 static sword32 mldsa_mont_red(sword64 a)
 {
 #ifndef MLDSA_MUL_QINV_SLOW
+#ifdef MLDSA_MUL_QINV_WIDE64
+    /* Low 32 bits of the q^-1 product via the 64-bit multiply; see
+     * MLDSA_MUL_QINV_WIDE64 in dilithium.h. */
+    sword64 t = (sword32)((sword64)(sword32)a * (sword64)MLDSA_QINV);
+#else
     sword64 t = (sword32)((sword32)a * (sword32)MLDSA_QINV);
+#endif
 #else
     sword64 t = (sword32)((sword32)a + (sword32)((sword32)a << 13) -
         (sword32)((sword32)a << 23) + (sword32)((sword32)a << 26));
diff --git a/wolfssl/wolfcrypt/dilithium.h b/wolfssl/wolfcrypt/dilithium.h
@@ -306,19 +306,15 @@
     #endif
 #endif
 
-/* On a 16-bit-int CPU (WC_16BIT_CPU) the toolchain synthesizes 32-/64-bit
- * multiplies, and some (e.g. TI cl2000 for the C2000 C28x) miscompile the
- * multiply-based Montgomery reduction in mldsa_mont_red().  The shift-based
- * reduction is mathematically identical (q = 2^23 - 2^13 + 1, and the q^-1
- * expansion likewise), avoids the wide multiply, and is also cheaper on such
- * targets.  Select it by default there; a user can still override either
- * macro explicitly. */
+/* MLDSA_MUL_QINV_WIDE64: compute the q^-1 step of mldsa_mont_red() through the
+ * 32x64->64 widening multiply instead of a 32x32->32 low multiply.  Some 16-bit
+ * toolchains miscompile the 32x32->32 form (e.g. TI cl2000 on the C28x); the
+ * widening form is correct on any conforming compiler and no slower.  Default
+ * it on for WC_16BIT_CPU; a user can force the shift form with
+ * MLDSA_MUL_QINV_SLOW / MLDSA_MUL_Q_SLOW. */
 #if defined(WC_16BIT_CPU)
-    #ifndef MLDSA_MUL_QINV_SLOW
-        #define MLDSA_MUL_QINV_SLOW
-    #endif
-    #ifndef MLDSA_MUL_Q_SLOW
-        #define MLDSA_MUL_Q_SLOW
+    #if !defined(MLDSA_MUL_QINV_SLOW) && !defined(MLDSA_MUL_QINV_WIDE64)
+        #define MLDSA_MUL_QINV_WIDE64
     #endif
 #endif