Skip to content

k256: endomorphism-aware wNAF for vartime scalar multiplication#1745

Open
42Pupusas wants to merge 5 commits into
RustCrypto:masterfrom
42Pupusas:k256/schnorr-verify-perf
Open

k256: endomorphism-aware wNAF for vartime scalar multiplication#1745
42Pupusas wants to merge 5 commits into
RustCrypto:masterfrom
42Pupusas:k256/schnorr-verify-perf

Conversation

@42Pupusas
Copy link
Copy Markdown
Contributor

Summary

Replaces the placeholder MulVartime / MulByGeneratorVartime impls
for k256::ProjectivePoint (which fell back to the constant-time path
and had TODOs to match) with a real endomorphism-aware width-5 wNAF,
then folds the combined mul_by_generator_and_mul_add_vartime into a
single shared-doublings ladder over all 4 GLV sub-scalars.

Closes #1725.

What changes

  • Commit 1 — wNAF core. GLV-decompose the scalar into two ~128-bit
    halves, compute a width-5 signed-digit NAF of each magnitude, and run
    a standard left-to-right double-and-add with precomputed odd
    multiples [P, 3P, ..., 15P]. Sign is folded into the precomputed
    points at setup.
  • Commit 2 — share doublings. Extract a small WnafSlot
    (odd-multiples table + digits) and a wnaf_ladder helper. The
    combined a*G + b*P variant runs one ladder over all 4 GLV slots
    (G + βG + P + βP), doing one double() per step instead of two
    independent ladders.
  • Commit 3 — debug_assert. Guard the fixed 130-entry wNAF digit
    buffer; the bound is currently implicit in WNAF_WIDTH = 5, this
    makes it explicit at test time.

Perf (Schnorr verify, default features, x86_64)

Stage time/verify
Master (constant-time fallback) ~62 µs
After commit 1 (wNAF) ~53 µs
After commit 2 (shared ladder) ~50 µs

~19% faster end-to-end. Also speeds up any other user of
MulVartime / MulByGeneratorVartime on the k256 curve.

Test plan

  • cargo test -p k256 --lib --features getrandom — 89 passed
  • New randomized tests for mul_vartime and
    mul_and_mul_add_vartime vs. the constant-time reference
    (32 iterations each with ProjectivePoint::generate() and
    Scalar::generate()).
  • Edge-case tests: scalar = 0, 1, −1, point = identity.
  • cargo bench -p k256 --bench schnorr -- verify on an idle host
    confirms the numbers above (criterion-reported change is stable
    across runs).

Notes

  • Not constant time; SECURITY: comments are on the two vartime impls.
    Only reachable via the MulVartime / MulByGeneratorVartime traits
    that callers opt into for non-secret scalars.
  • Briefly explored several further optimizations (batched-affine
    odd-multiples via Montgomery's trick; static precomputed G tables
    with mixed-add; wider window for the G side). The first two
    regressed perf at this width/scalar size; the last gave ~4% more but
    added a ~6 KB static, const-generics, and a new LazyLock path —
    not worth the complexity for a single-curve specialization. This PR
    sticks to the change that's pure upside.

42Pupusas and others added 4 commits May 9, 2026 09:20
Replaces the placeholder MulVartime / MulByGeneratorVartime impls (which
just called the constant-time path and had TODOs to match) with a width-5
wNAF that uses the GLV endomorphism to split each scalar into two
~128-bit halves.

Schnorr verify: ~62 µs -> ~53 µs (14% faster, no precomputed-tables;
~55 µs with tables). Addresses RustCrypto#1725.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the combined `mul_by_generator_and_mul_add_vartime` into a single
wNAF ladder over all 4 GLV sub-scalars (s1, s2 for G and the
endomorphism; e1, e2 for P and the endomorphism). One `double()` per
step instead of two independent ladders.

Factors out a small `WnafSlot` (odd-multiples table + digits) and a
`wnaf_ladder` helper so the single-point `mul_vartime` and the combined
op share the same loop body.

Schnorr verify: ~53 µs -> ~50 µs (no precomputed-tables; ~51 µs with
tables). Total vs. pre-wNAF baseline: ~62 µs -> ~50 µs (~19% faster).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` writes into a fixed 130-entry buffer; the bound holds for the
current `WNAF_WIDTH = 5` and the ≤128-bit GLV sub-scalars, but it's
implicit. Add a `debug_assert!` in the loop so that any future change to
`WNAF_WIDTH` that invalidates the bound is caught at test time rather
than silently writing out of bounds in worst-case inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` tracked the residual scalar in two u64 limbs, but a negative
recentered digit adds up to 2^(W-1) − 1 to the value, which can legitimately
overflow past bit 127 when the input is close to 2^128 − 1. The old code
let `hi.wrapping_add(1)` silently wrap, losing the carried bit and
producing a NAF that reconstructs to the wrong value.

The GLV decomposition's `(r1, r2)` each have magnitude strictly less than
2^128, so values in the carry-out window are possible (though
vanishingly rare in random scalars — which is why the existing
randomized tests never caught it).

Fix by carrying the overflow bit into a third limb `top` that is absorbed
back on the next right-shift. Perf impact is in the noise: the `top`
branch is almost never taken and the predictor handles it cleanly.

Add two regression tests:

- `test_wnaf_128_reconstruction_adversarial` — reconstructs the NAF of a
  scalar with low 128 bits = 0xFF..FF and asserts it equals 2^128 − 1.
- `test_mul_vartime_adversarial_scalars` — end-to-end check that
  `mul_vartime(P, k)` matches the constant-time reference when `k`'s low
  128 bits trigger the carry window.

Also add a `debug_assert!` on `idx` in `WnafSlot::apply` to guard the
parallel invariant (`idx < WNAF_TABLE_SIZE`) if `WNAF_WIDTH` is ever
widened without growing the table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tarcieri
Copy link
Copy Markdown
Member

It would be good if this could be implemented in terms of types from the (rustcrypto-)group crate and its existing group::{Wnaf, WnafBase, WnafScalar}.

That will probably require API changes to (rustcrypto-)group, e.g. potentially moving scalar multiplication to the WnafGroup trait which can provide a method but let this crate plug in its own that's aware of the endomorphism. But if done right I think it would let us reuse all the types, including the new BasepointTableVartime which many other curves in this repo are using to accelerate e.g. ECDSA verification.

@42Pupusas
Copy link
Copy Markdown
Contributor Author

42Pupusas commented May 10, 2026

that makes sense, I should have thought of that but was kind of tunnel visioned on this crate.

will explore this direction @tarcieri , if an API change is needed to the group crate I am guessing a corresponding PR should be raised there?

@tarcieri
Copy link
Copy Markdown
Member

Yes, and you should be able to use patch.crates-io to update this PR

@42Pupusas 42Pupusas force-pushed the k256/schnorr-verify-perf branch 2 times, most recently from 9e3e659 to d22cd8a Compare May 14, 2026 17:06
Replace custom wNAF implementation (wnaf_128, build_odd_multiples,
WnafSlot, wnaf_ladder) with the group crate's WnafBase/WnafScalar
types and WnafBase::multiscalar_mul_array.

A new WnafScalar::from_le_bytes constructor accepts short (128-bit)
GLV half-scalars, producing ~half the wNAF digits and ~half the
doublings in the evaluation loop. multiscalar_mul_array avoids the
two collect() heap allocations of the iterator-based multiscalar_mul.

Depends on RustCrypto/group#15 for the group
crate changes (wnaf_table size fix, from_le_bytes, multiscalar_mul_array,
pre-sized Vec allocations).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@42Pupusas 42Pupusas force-pushed the k256/schnorr-verify-perf branch from d22cd8a to 577a58b Compare May 14, 2026 17:09
@42Pupusas
Copy link
Copy Markdown
Contributor Author

@tarcieri the new approach maintained most of the performance gains, but still falls behind a bit behind the hand-rolled version due to some extra allocations that happen in the group crate.

From my initial exploration, the Wnaf primitives could add a const generic for TABLE_SIZE, basically dropping heap allocations to 0 for the mul paths. However this is much more intrusive and potentially cascades down to most other curves that use it, so I held off from moving forward on that path.

If that approach seems valuable to you, I can follow up with a PR for that

@tarcieri
Copy link
Copy Markdown
Member

@42Pupusas I would suggest keeping the changes as minimal as possible, as they need to get upstreamed to https://github.com/zkcrypto/group

It does look like there are a few things in the PR you opened which could be split out into their own PRs and upstreamed directly though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

k256: endomorphism-aware wNAF implementation

2 participants