Skip to content

Add Highway SIMD acceleration to ImageBufAlgo [add, sub, mul, div, mad, resample]#4994

Open
ssh4net wants to merge 2 commits intoAcademySoftwareFoundation:mainfrom
ssh4net:_hwy
Open

Add Highway SIMD acceleration to ImageBufAlgo [add, sub, mul, div, mad, resample]#4994
ssh4net wants to merge 2 commits intoAcademySoftwareFoundation:mainfrom
ssh4net:_hwy

Conversation

@ssh4net
Copy link
Copy Markdown
Contributor

@ssh4net ssh4net commented Jan 7, 2026

Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library: • add/sub
• mul/div
• mad
• resample
Adds CMake and build system support, new implementation helpers, and developer documentation.

Code mostly wrote using frontier Opus4.5 and Codex GPT5.2 High models with a strict rules.

Checklist:

  • I have read the guidelines on contributions and code review procedures.
  • I have updated the documentation if my PR adds features or changes
    behavior.
  • I am sure that this PR's changes are tested somewhere in the
    testsuite
    .
  • I have run and passed the testsuite in CI before submitting the
    PR, by pushing the changes to my fork and seeing that the automated CI
    passed there. (Exceptions: If most tests pass and you can't figure out why
    the remaining ones fail, it's ok to submit the PR and ask for help. Or if
    any failures seem entirely unrelated to your change; sometimes things break
    on the GitHub runners.)
  • My code follows the prevailing code style of this project and I
    fixed any problems reported by the clang-format CI test.
  • If I added or modified a public C++ API call, I have also amended the
    corresponding Python bindings. If altering ImageBufAlgo functions, I also
    exposed the new functionality as oiiotool options.

@lgritz
Copy link
Copy Markdown
Collaborator

lgritz commented Jan 7, 2026

I suspect you used LLM for some of this? Which is fine, but I think you should document in the PR description (commit comment) which tool you used and for what parts.

Comment on lines +127 to +131
template<class Rtype, class Atype, class Btype>
static bool
add_impl_hwy(ImageBuf& R, const ImageBuf& A, const ImageBuf& B, ROI roi,
int nthreads)
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a line-by-line comparison, but it seems to me that the only difference between add_impl_hwy, sub_impl_hwy, and mul_impl_hwy is likely going to be

[](auto d, auto a, auto b) { return hn::Add(a, b); }

versus that one lambda changing for Sub and Mul.

I would love for even the initial commit to reduce this whole thing to a shared hwy_binary_perpixel_op() template that takes the lambda housing the op kernel as a templated parameter.

Comment on lines +155 to +161
// Process pixel by pixel (scalar fallback for strided channels)
for (int x = roi.xbegin; x < roi.xend; ++x) {
Rtype* r_ptr = ChannelPtr<Rtype>(Rv, x, y, roi.chbegin);
const Atype* a_ptr = ChannelPtr<Atype>(Av, x, y,
roi.chbegin);
const Btype* b_ptr = ChannelPtr<Btype>(Bv, x, y,
roi.chbegin);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should benchmark the strided case and see how it compares to the contiguous case and the full scalar fallback that we've always had. If there is no big speed gain, I would be in favor of eliminating this whole clause and let non-contiguous strides use the old scalar path, then there is much less template expansion for hwy in the cases where there is not a large gain to be had. Note that this means that the "to hwy or not to hwy" test would need to test contiguity in addition to just localpixels().

@lgritz
Copy link
Copy Markdown
Collaborator

lgritz commented Feb 19, 2026

@ssh4net It's been a while since this PR has been updated, but after your last push, it's failing to build. Can you please rebase on main, fix so it passes CI, and ensure that there is a DCO sign-off on each commit? I would like to proceed with this in some form.

@ssh4net
Copy link
Copy Markdown
Contributor Author

ssh4net commented Feb 20, 2026

@lgritz sure! Give me a bit of time. I have added some fixes based on discussion above, but not verified fully yet, and switched to other projects 😅
Will check, rebase and push soon.

@ssh4net
Copy link
Copy Markdown
Contributor Author

ssh4net commented Feb 24, 2026

@lgritz, please check. lateast code. It now should support SIMD path for RGB ROI in case of RGBA inputs.

MAD-Specific Note
mad can legitimately differ from the scalar a*b + c by tiny amounts when HWY uses FMA (MulAdd), because fused vs non-fused evaluation changes rounding. In normalized integer formats that often show up as exactly 1 LSB (e.g., uint16 max error 1/65535 = 1.52588e-05), and for uint32 it can reflect float32 precision limits after scaling/rounding.

test tool in the archive
hwy_test.zip

And I see CL errors in this repo when in my repo all of them run successfully.

@1div0
Copy link
Copy Markdown
Contributor

1div0 commented Mar 1, 2026

Small error in the maximum error 1/65536 instead of the 1/65535. This is really MAD.

Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library:
• add/sub
• mul/div
• mad
• resample
Adds CMake and build system support, new implementation helpers, and developer documentation.

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
@lgritz
Copy link
Copy Markdown
Collaborator

lgritz commented Apr 6, 2026

@ssh4net I'm sorry, I think I didn't realize how big a change you made after my last comments. I actually came back to this today thinking that maybe I'd pitch in and do some of the refactoring that I suggested, but... you had already done it! I really like what you've done here, it's much simpler at the IBA call sites than last time I viewed your earlier drafts.

There are just a handful of tests failing, and I'm looking into it.

I will update you soon, but I wanted you to know that this is still a very active PR and I do expect us to merge this with only a few other minor changes.

@ssh4net
Copy link
Copy Markdown
Contributor Author

ssh4net commented Apr 6, 2026

@lgritz sure, no problem.

@lgritz
Copy link
Copy Markdown
Collaborator

lgritz commented Apr 6, 2026

Would you mind if I squashed these, rebased on main, fix the conflicts, and force push the results so we have a clean single commit to build on? I know why the tests are failing. (At least some of them.)

@ssh4net
Copy link
Copy Markdown
Contributor Author

ssh4net commented Apr 6, 2026

@lgritz sure, do what you think is needed.

* Make cmake option OIIO_USE_HWY able to be overridden by end (by using
  `set_option()`)

* Let env variable OPENIMAGEIO_ENABLE_HWY override runtime hwy
  enabling. (It was documented to do so, but not yet implemented.)

* Clarify that hwy support will be added for 3.2, not 3.1. And that
  for right now, it defaults to OFF until we have worked out all the
  bugs and gain more confidence in it. The intention is for it to be
  enabled by default in time for the release of OIIO 3.2.

* When hwy is disabled at build time, don't allow attribute() to appear
  to enable it.

Signed-off-by: Larry Gritz <lg@larrygritz.com>
@lgritz
Copy link
Copy Markdown
Collaborator

lgritz commented Apr 6, 2026

I squashed, rebased (fixing conflicts), and pushed, so now this is directly versus the current main.

I made the following additions so far, amended to your work:

  • Make the build-time cmake option OIIO_USE_HWY able to be overridden by environment (by using set_option())

  • Let env variable OPENIMAGEIO_ENABLE_HWY override runtime hwy enabling. (It was documented to do so, but not yet implemented.)

  • Clarify that hwy support will be added for 3.2, not 3.1. And that for right now, it defaults to OFF until we have worked out all the bugs and gain more confidence in it. The intention is for it to be enabled by default in time for the release of OIIO 3.2.

  • When hwy is disabled at build time, don't allow attribute() to appear to enable it.

Note in particular that I've changed the default behavior to

  1. Build with support if hwy is present AND it is enabled at build time with -DOIIO_USE_HWY=ON.
  2. Enable runtime support only if either (a) env variable OPENIMAGEIO_ENABLE_HWY is nonzero, or (b) OIIO::attribute("enable_hwy") is set by the program to nonzero.

I believe that makes it totally safe, even with any current flaws, because people need to take action to build it, and also to run it even if it was enabled at build time. The intent, of course, is to switch both of those defaults to "enable" by the time we ship 3.2, but we have some more work and fixes to do first.

I think that with this, I would like to propose that we merge what we have here (assuming that CI passes), and then I will have some additional fixes coming shortly. But then with this merged, all the changes from here on out should be small and easy to review.

@lgritz
Copy link
Copy Markdown
Collaborator

lgritz commented Apr 6, 2026

Like I said, the default as I have it here now is that it's disabled (making bugs harmless until we fix them).

I think that after merging, the remaining tasks are as follows:

  1. There are some test failures currently when it's enabled, and I think I have identified at least one of the causes: the test for when we can use the hwy code paths (versus when we fall back to the old pixel-by-pixel scalar code) is not quite right. You test for the images all being local pixels and for those pixels to be contiguous across a scanline, but that's not enough! It turns out that also all of the input and output ImageBufs need to have their valid pixel data regions fully overlap the ROI that we are operating on, because the hwy code paths are not considering pixels outside the data windows. Because the ImageBuf::Iterator knows how to handle this (automatically pointing to black pixel data when "off the edge"), the logic in the old pixel loops didn't need to explicitly handle this, so it was probably not apparent to you looking at the old code that it was a condition we needed to account for.

    I have a fix cooking for that, though I also need to think about if there are any other hidden conditions that haven't occurred to either of us yet.

  2. The next thing I want to do (and I have a plan for how to do it) is set up the testsuite and the CI to be sure we are testing both scalar and hwy code.

  3. At that point, and if there are no apparent failures left, we can think about switching the build and runtime defaults to enable hwy.

  4. It would be great to have "self-builders" set up for hwy so that we can use it even if not found already installed on the system.

  5. I want to re-benchmark everything, and make sure that we're only enabling hwy for the particular cases where it seems to speed things up, on the various platforms we test on. We want it enabled by default, and to be confident that it only kicks in for cases where it will improve performance.

    As part of this, I think there will be some cases where it's slow and we probably should expect it. There may also be cases where we think it should be faster, but it's not, and we want to take the time to investigate that and make sure there isn't some simple reason that we can fix.

At that point, I think we an return to looking at other IBA functions and trying to figure out which cases have benefits to hwy code paths and enhance them one by one. (With some attention to a priority that makes sense -- a combination of which look most promising for SIMD speedups and which are the most commonly used operations.)

But I'm gating this all on your approving the small number of changes from me that I attached here; I don't want to merge them into your PR without you approving it. If we are both satisfied with what we have here for the moment, then I will merge and move on to the other items listed above.

@1div0
Copy link
Copy Markdown
Contributor

1div0 commented Apr 6, 2026

I will take another look at the code this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants