Skip to content

Add contrib intdiv: fast integer division by invariant scalars using multiplication#2875

Open
abhishek-iitmadras wants to merge 2 commits into
google:masterfrom
abhishek-iitmadras:abhishekk_intdiv
Open

Add contrib intdiv: fast integer division by invariant scalars using multiplication#2875
abhishek-iitmadras wants to merge 2 commits into
google:masterfrom
abhishek-iitmadras:abhishekk_intdiv

Conversation

@abhishek-iitmadras
Copy link
Copy Markdown
Contributor

@abhishek-iitmadras abhishek-iitmadras commented Feb 23, 2026

This change adds a contrib module implementing fast integer division by invariant (loop-constant) divisors using multiplication and shifts, following Granlund & Montgomery, “Division by Invariant Integers Using Multiplication” (PLDI 1994).

  • Supports all scalar lane widths and signs:

    • Unsigned: uint8_t, uint16_t, uint32_t, uint64_t
    • Signed: int8_t, int16_t, int32_t, int64_t
  • This contrib module provides general-purpose, cross-architecture implementation of division by invariant scalars using multiplication, suitable for vectorized code built on Highway. It mirrors the GM(Algo) scheme and is conceptually similar to the integer SIMD division intrinsics used in NumPy’s npyv_intdiv, but expressed purely in Highway’s portable SIMD API.

Benchmarking Script : https://gist.github.com/abhishek-iitmadras/169f995df2bf9b1e7827c712528af0c2

Comment thread BUILD Outdated
Copy link
Copy Markdown
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments :) Some influence others, so please read them all before addressing any.

Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated
*
* We split the work into two steps:
* 1) Precompute parameters from the scalar divisor (multiplier + shifts).
* DivisorParams{U,S}<T> ComputeDivisorParams(T divisor);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse any of the existing logic from base.h Divisor[64]?

Copy link
Copy Markdown
Contributor Author

@abhishek-iitmadras abhishek-iitmadras May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good to use pre-exist API so:

  • now reuse the existing Divisor64 128-bit division primitive instead of duplicating it in our intdiv-inl.h

  • But to work correct we need to do chnage in base.h is to make Divisor64::Div128 public in the HWY_HAVE_DIV128 implementation [Currently it is private]. Then only intdiv-inl.h can call: hwy::Divisor64::Div128(high, divisor).

  • Also keep the full DivisorParamsU / DivisorParamsS logic in intdiv-inl.h because Divisor / Divisor64 do not cover signed params, 8/16-bit widening params, pow2 metadata, divisor storage for remainder, or floor-division correction. But the 128-bit high-half division primitive should definitely be shared.

Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated
};

template <>
struct MulType<uint8_t> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use existing functionality: base.h also has

template <>
struct Relations<uint8_t> {
  using Unsigned = uint8_t;
  using Signed = int8_t;
  using Wide = uint16_t;
};

etc, so we could use MulType = Relations::Wide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it partially applicable but not a direct replacement as mapping diverge for 32-bit and 64-bit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated
Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated
Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated
Comment thread hwy/contrib/intdiv/intdiv.h
Comment thread hwy/contrib/intdiv/intdiv.h Outdated
return HWY_NAMESPACE::ComputeDivisorParams<T>(d);
}

template <typename T, HWY_IF_T_SIZE(T, 1), HWY_IF_SIGNED(T)>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than SFINAE for T size 1..8, we can HWY_IF_T_SIZE_ONE_OF(T, (1 << 1) | (1 << 2) | (1 << 4) | (1 << 8)), or better yet, just static_assert IsPow2(sizeof(T)) within one function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it , now collapse repeated size-based SFINAE overloads into single signed and unsigned overload using static_assert for supported integer sizes

Comment thread hwy/contrib/intdiv/intdiv.h Outdated
return HWY_NAMESPACE::ComputeDivisorParams<T>(d);
}

template <class D, class V = VecD<D>, typename T = TFromD_<D>, HWY_IF_UNSIGNED_D(D)>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also we could static_assert(IsUnsigned()) inside the function, given that you have a DivisorParamsU argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here also

Comment thread hwy/contrib/intdiv/intdiv_test.cc Outdated
if constexpr (sizeof(T) <= 4) {
return static_cast<T>(Random32(&rng));
} else {
const uint64_t hi = Random32(&rng);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a Random64().

Copy link
Copy Markdown
Contributor Author

@abhishek-iitmadras abhishek-iitmadras Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks
Done , i have used Random64() now

Comment thread hwy/contrib/intdiv/intdiv_test.cc Outdated
}

template <typename T>
bool IsPow2(T x) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already defined in intdiv-inl.h?

Copy link
Copy Markdown
Contributor Author

@abhishek-iitmadras abhishek-iitmadras Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne, Removed from here

@abhishek-iitmadras abhishek-iitmadras marked this pull request as ready for review March 16, 2026 16:13
Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated

template <>
struct MultiplierType<uint32_t> {
using type = uint32_t;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use using type = If<(sizeof(T) < 4), Relations::Wide, T>.

Copy link
Copy Markdown
Contributor Author

@abhishek-iitmadras abhishek-iitmadras May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per your suggestion , i agree with reusing the existing Highway type relation, but i don't think If<(sizeof(T) < 4), MakeWide<T>, T> is safe here.

SO Looking at hwy/base.h, Relations<int64_t> has no Wide member (only Unsigned/Signed/Float/Narrow), so MakeWide<int64_t> is ill-formed.

SO And if i am not wrong then the main issue is that the type arguments to If still have to be well-formed before selection.
SO For int64_t, MakeWide<int64_t> is ill-formed because Relations<int64_t> does not define Wide. so even though sizeof(int64_t) < 4 is false, the expression still fails during substitution.

The partial specialization keeps the selection lazy:

template <typename T, bool kNeedsWiden = (sizeof(T) < 4)>
struct MultiplierType {
  using type = T;
};
template <typename T>
struct MultiplierType<T, true> {
  using type = MakeWide<T>;
};

Now this still reuses Highway's MakeWide<T> / Relations<T>::Wide for 8- and 16-bit types, but avoids instantiating MakeWide<T> for 32- and 64-bit types. That is also intentional because the 32/64-bit paths use same-width multiplier storage and have separate MulHigh / Div128 handling.

Correct me if i am wrong

Signed-off-by: Abhishek Kumar <abhishek.r.kumar@fujitsu.com>
@abhishek-iitmadras
Copy link
Copy Markdown
Contributor Author

abhishek-iitmadras commented May 12, 2026

Hi @jan-wassenberg

Benchmarking script : https://gist.github.com/abhishek-iitmadras/169f995df2bf9b1e7827c712528af0c2

Machine Detail : https://instances.vantage.sh/aws/ec2/c7g.metal?currency=USD (SVE_256 Machine)

Benchmarks were run on Arm SVE-256. The divisor is supplied via Google Benchmark arguments (state.range(1)), so it is a runtime divisor, not a compile-time constant. IntDivHot precomputes divisor parameters once from that runtime divisor and reuses them across the array.

Measured sizes:

Expression N
1 << 10 1,024
1 << 14 16,384
1 << 18 262,144
1 << 20 1,048,576

Runtime divisor sets:

Type Divisors
uint8_t 1, 2, 3, 255, 254
uint16_t 1, 2, 3, 65535, 65534
uint32_t 1, 2, 3, 4294967295, 4294967294
uint64_t 1, 2, 3, 9223372036854775807, 9223372036854775806
int8_t 1, 2, 3, -1, -2, -3, 127, 126
int16_t 1, 2, 3, -1, -2, -3, 32767, 32766
int32_t 1, 2, 3, -1, -2, -3, 2147483647, 2147483646
int64_t 1, 2, 3, -1, -2, -3, 9223372036854775807, 9223372036854775806

Result :

Table below reports N = 1,048,576 and divisor = 3 because it is a representative non-power-of-two runtime divisor and shows steady-state throughput. Similar speedups were observed across the other measured N and divisor combinations; full raw benchmark logs are linked below.

All result numbers are in ns

Type Scalar Div hwy::Div IntDiv hot Div Speedup vs Scalar Div Speedup vs hwy::Div Scalar Rem Hwy::Mod IntRem hot Rem Speedup vs Scalar Rem Speedup vs Hwy::Mod
uint8_t 2420803 354136 110614 21.89x 3.20x 2824895 354777 124355 22.72x 2.85x
uint16_t 3181084 354224 222489 14.30x 1.59x 3590825 354230 250584 14.33x 1.41x
uint32_t 404994 405026 156150 2.59x 2.59x 5201766 405090 182853 28.45x 2.22x
uint64_t 1416638 1416703 376722 3.76x 3.76x 8453969 1416804 453290 18.65x 3.13x
int8_t 354220 364346 133731 2.65x 2.72x 2821562 365327 152098 18.55x 2.40x
int16_t 354205 354713 268272 1.32x 1.32x 3531486 354317 304257 11.61x 1.16x
int32_t 405057 405025 209654 1.93x 1.93x 5148002 405842 234037 22.00x 1.73x
int64_t 1412837 1410457 457177 3.09x 3.09x 8384109 1410448 539478 15.54x 2.61x

Raw benchmark logs

Type Log
int8_t sve256_int8.txt
int16_t sve256_int16.txt
int32_t sve256_int32.txt
int64_t sve256_int64.txt
uint8_t sve256_uint8.txt
uint16_t sve256_uint16.txt
uint32_t sve256_uint32.txt
uint64_t sve256_uint64.txt

@abhishek-iitmadras
Copy link
Copy Markdown
Contributor Author

abhishek-iitmadras commented May 12, 2026

little detail about benchmarking script : https://gist.github.com/abhishek-iitmadras/169f995df2bf9b1e7827c712528af0c2

Benchmark What it measures Divisor handling Purpose
BM_ScalarDivAutoVec<T> Plain C++ scalar loop using / Runtime divisor from state.range(1) Baseline for normal C++ division. Auto-vectorization is allowed.
BM_HwyDiv<T> Existing Highway hn::Div(v, divv) Runtime divisor broadcast with hn::Set(d, divisor) Baseline for existing generic Highway vector division.
BM_IntDivHot<T> New hn::IntDiv(d, v, params) path Runtime divisor; ComputeDivisorParams is done once outside the timed loop Main benchmark for intended invariant-divisor use case.
BM_IntDivCold<T> New hn::IntDiv(d, v, params) path including setup Runtime divisor; ComputeDivisorParams is done inside the timed loop Measures cost when divisor parameters are not precomputed ahead of time.
BM_IntDivMixedDivisors<T> New hn::IntDiv with different precomputed params per vector chunk Params are precomputed; timed loop only selects among them Branch-stress benchmark for params.is_pow2 behavior.
BM_ScalarRemAutoVec<T> Plain C++ scalar loop using % Runtime divisor from state.range(1) Baseline for normal C++ remainder. Auto-vectorization is allowed.
BM_HwyMod<T> Existing Highway hn::Mod(v, divv) Runtime divisor broadcast with hn::Set(d, divisor) Baseline for existing generic Highway vector remainder.
BM_IntRemHot<T> Remainder via q = IntDiv(...), then r = a - q * divisor Runtime divisor; params are precomputed outside timed loop Measures new invariant-divisor remainder path.
BM_ScalarFloorDivAutoVec<T> Scalar signed floor division using Python/NumPy semantics Runtime divisor from state.range(1) Baseline for signed floor division. Signed types only.
BM_IntFloorDivHot<T> New vectorized signed floor division via hn::IntDivFloor Runtime divisor; params are precomputed outside timed loop Measures vectorized Python/NumPy-style floor division. Signed types only.
BM_ComputeParams<T> Only hn::ComputeDivisorParams(divisor) Runtime divisor from state.range(0) Measures divisor-parameter precomputation cost alone.
BM_HwyBaseDivisorU32 Scalar hwy::Divisor from hwy/base.h Runtime uint32_t divisor; scalar precomputed helper Optional comparison against existing scalar unsigned 32-bit helper. Not registered by default.
BM_HwyBaseDivisorU64 Scalar hwy::Divisor64 from hwy/base.h Runtime uint64_t divisor; scalar precomputed helper Optional comparison against existing scalar unsigned 64-bit helper. Not registered by default.
BM_ScalarDivInPlace<T> In-place scalar array division Runtime divisor from state.range(1) Baseline for in-place array API.
BM_IntDivArrayInPlace<T> In-place array division using hn::DivideArrayByScalar Runtime divisor; params are computed inside array helper Measures end-to-end array API performance.

Comment thread hwy/contrib/intdiv/intdiv-inl.h Outdated

constexpr int kShift = static_cast<int>(sizeof(T) * 8);

#if defined(HWY_HAVE_ORDEREDDEMOTE2TO)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue here? I'd think/hope this is always available. What are the circumstances when it isn't? Can we remove the #if?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue here? I'd think/hope this is always available. What are the circumstances when it isn't? Can we remove the #if?

Thanks , you are right

I confused myself — #if is to false everywhere and every build has been silently going through the Combine + DemoteTo fallback path.

As per the docs, OrderedDemote2To is available whenever HWY_TARGET != HWY_SCALAR.

But i already did if constexpr (D::kPrivateLanes < 2) already routes HWY_SCALAR (and any other single-lane tag) to ScalarDivPerLane, so by the time we reach this line we're guaranteed to be on a target where OrderedDemote2To exists.

So the answer to "when isn't it available?" is: only on HWY_SCALAR, which can't reach this code path anyway.

SO i am removing the #if and calling OrderedDemote2To unconditionally in both the signed and unsigned IntDiv branches.

I guess numbers should improve slightly since they'll now actually hit the optimized demote path.
Thanks again to point out this

jan-wassenberg
jan-wassenberg previously approved these changes May 15, 2026
Copy link
Copy Markdown
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be out from Sat to May26, so let's try to land this first and address the #if and branch later.

@abhishek-iitmadras
Copy link
Copy Markdown
Contributor Author

Sorry , @jan-wassenberg
look like new commit push revert your stale review decision , please proceed again with this 😊 which also trigger two import/copybara CI's

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants