k256: endomorphism-aware wNAF for vartime scalar multiplication#1745
k256: endomorphism-aware wNAF for vartime scalar multiplication#174542Pupusas wants to merge 5 commits into
Conversation
Replaces the placeholder MulVartime / MulByGeneratorVartime impls (which just called the constant-time path and had TODOs to match) with a width-5 wNAF that uses the GLV endomorphism to split each scalar into two ~128-bit halves. Schnorr verify: ~62 µs -> ~53 µs (14% faster, no precomputed-tables; ~55 µs with tables). Addresses RustCrypto#1725. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the combined `mul_by_generator_and_mul_add_vartime` into a single wNAF ladder over all 4 GLV sub-scalars (s1, s2 for G and the endomorphism; e1, e2 for P and the endomorphism). One `double()` per step instead of two independent ladders. Factors out a small `WnafSlot` (odd-multiples table + digits) and a `wnaf_ladder` helper so the single-point `mul_vartime` and the combined op share the same loop body. Schnorr verify: ~53 µs -> ~50 µs (no precomputed-tables; ~51 µs with tables). Total vs. pre-wNAF baseline: ~62 µs -> ~50 µs (~19% faster). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` writes into a fixed 130-entry buffer; the bound holds for the current `WNAF_WIDTH = 5` and the ≤128-bit GLV sub-scalars, but it's implicit. Add a `debug_assert!` in the loop so that any future change to `WNAF_WIDTH` that invalidates the bound is caught at test time rather than silently writing out of bounds in worst-case inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` tracked the residual scalar in two u64 limbs, but a negative recentered digit adds up to 2^(W-1) − 1 to the value, which can legitimately overflow past bit 127 when the input is close to 2^128 − 1. The old code let `hi.wrapping_add(1)` silently wrap, losing the carried bit and producing a NAF that reconstructs to the wrong value. The GLV decomposition's `(r1, r2)` each have magnitude strictly less than 2^128, so values in the carry-out window are possible (though vanishingly rare in random scalars — which is why the existing randomized tests never caught it). Fix by carrying the overflow bit into a third limb `top` that is absorbed back on the next right-shift. Perf impact is in the noise: the `top` branch is almost never taken and the predictor handles it cleanly. Add two regression tests: - `test_wnaf_128_reconstruction_adversarial` — reconstructs the NAF of a scalar with low 128 bits = 0xFF..FF and asserts it equals 2^128 − 1. - `test_mul_vartime_adversarial_scalars` — end-to-end check that `mul_vartime(P, k)` matches the constant-time reference when `k`'s low 128 bits trigger the carry window. Also add a `debug_assert!` on `idx` in `WnafSlot::apply` to guard the parallel invariant (`idx < WNAF_TABLE_SIZE`) if `WNAF_WIDTH` is ever widened without growing the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
It would be good if this could be implemented in terms of types from the That will probably require API changes to |
|
that makes sense, I should have thought of that but was kind of tunnel visioned on this crate. will explore this direction @tarcieri , if an API change is needed to the |
|
Yes, and you should be able to use |
9e3e659 to
d22cd8a
Compare
Replace custom wNAF implementation (wnaf_128, build_odd_multiples, WnafSlot, wnaf_ladder) with the group crate's WnafBase/WnafScalar types and WnafBase::multiscalar_mul_array. A new WnafScalar::from_le_bytes constructor accepts short (128-bit) GLV half-scalars, producing ~half the wNAF digits and ~half the doublings in the evaluation loop. multiscalar_mul_array avoids the two collect() heap allocations of the iterator-based multiscalar_mul. Depends on RustCrypto/group#15 for the group crate changes (wnaf_table size fix, from_le_bytes, multiscalar_mul_array, pre-sized Vec allocations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d22cd8a to
577a58b
Compare
|
@tarcieri the new approach maintained most of the performance gains, but still falls behind a bit behind the hand-rolled version due to some extra allocations that happen in the From my initial exploration, the If that approach seems valuable to you, I can follow up with a PR for that |
|
@42Pupusas I would suggest keeping the changes as minimal as possible, as they need to get upstreamed to https://github.com/zkcrypto/group It does look like there are a few things in the PR you opened which could be split out into their own PRs and upstreamed directly though |
Summary
Replaces the placeholder
MulVartime/MulByGeneratorVartimeimplsfor
k256::ProjectivePoint(which fell back to the constant-time pathand had TODOs to match) with a real endomorphism-aware width-5 wNAF,
then folds the combined
mul_by_generator_and_mul_add_vartimeinto asingle shared-doublings ladder over all 4 GLV sub-scalars.
Closes #1725.
What changes
halves, compute a width-5 signed-digit NAF of each magnitude, and run
a standard left-to-right double-and-add with precomputed odd
multiples
[P, 3P, ..., 15P]. Sign is folded into the precomputedpoints at setup.
WnafSlot(odd-multiples table + digits) and a
wnaf_ladderhelper. Thecombined
a*G + b*Pvariant runs one ladder over all 4 GLV slots(G + βG + P + βP), doing one
double()per step instead of twoindependent ladders.
buffer; the bound is currently implicit in
WNAF_WIDTH = 5, thismakes it explicit at test time.
Perf (Schnorr verify, default features, x86_64)
~19% faster end-to-end. Also speeds up any other user of
MulVartime/MulByGeneratorVartimeon the k256 curve.Test plan
cargo test -p k256 --lib --features getrandom— 89 passedmul_vartimeandmul_and_mul_add_vartimevs. the constant-time reference(32 iterations each with
ProjectivePoint::generate()andScalar::generate()).cargo bench -p k256 --bench schnorr -- verifyon an idle hostconfirms the numbers above (criterion-reported change is stable
across runs).
Notes
SECURITY:comments are on the two vartime impls.Only reachable via the
MulVartime/MulByGeneratorVartimetraitsthat callers opt into for non-secret scalars.
odd-multiples via Montgomery's trick; static precomputed G tables
with mixed-add; wider window for the G side). The first two
regressed perf at this width/scalar size; the last gave ~4% more but
added a ~6 KB static, const-generics, and a new
LazyLockpath —not worth the complexity for a single-curve specialization. This PR
sticks to the change that's pure upside.