Skip to content

Commit 556198f

Browse files
authored
Add Highway SIMD acceleration to ImageBufAlgo [add, sub, mul, div, mad, resample] (#4994)
Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library: * add/sub * mul/div * mad * resample Adds CMake and build system support, new implementation helpers, and developer documentation. This is currently off by default, needing to be enabled at build time both by setting CMake variable `OIIO_USE_HWY` to ON *and* having the Highway library installed and able to be found (we do not yet auto-build it). And even if enabled at build time, currently it is only enabled at runtime by setting the environment variable `OPENIMAGEIO_ENABLE_HWY=1` *or* setting the global attribute `OIIO::attribute("enable_hwy", 1)`. We hope that the defaults might be changed to fully enabled by the time we release OIIO 3.2. Assisted by: Opus4.5 and Codex GPT5.2 _Code mostly wrote using frontier Opus4.5 and Codex GPT5.2 High models with a strict rules._ --------- Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
1 parent 59a3201 commit 556198f

16 files changed

+2658
-75
lines changed

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ else ()
112112
endif ()
113113
option (${PROJ_NAME}_BUILD_TOOLS "Build the command-line tools" ON)
114114
option (${PROJ_NAME}_BUILD_TESTS "Build the unit tests" ON)
115+
set_option (OIIO_USE_HWY "Enable experimental Google Highway SIMD optimizations (if Highway is available)" OFF)
115116
set (OIIO_LIBNAME_SUFFIX "" CACHE STRING
116117
"Optional name appended to ${PROJECT_NAME} libraries that are built")
117118
option (BUILD_OIIOUTIL_ONLY "If ON, will build *only* libOpenImageIO_Util" OFF)

docs/dev/Architecture.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,10 @@ objects. These algorithms include simple operations like copying, resizing,
117117
and compositing images, as well as more complex operations like color
118118
conversions, resizing, filtering, etc.
119119

120+
Some performance-critical `ImageBufAlgo` implementations have SIMD-accelerated
121+
paths using Google Highway. For implementation details and guidance for adding
122+
new kernels, see `docs/dev/ImageBufAlgo_Highway.md`.
123+
120124
## Image caching: TextureSystem and ImageCache
121125

122126
There are situations where ImageBuf is still not the right abstraction,

docs/dev/ImageBufAlgo_Highway.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
ImageBufAlgo Highway (hwy) Implementation Guide
2+
==============================================
3+
4+
This document explains how OpenImageIO uses Google Highway (hwy) to accelerate
5+
selected `ImageBufAlgo` operations, and how to add or modify kernels in a way
6+
that preserves OIIO semantics while keeping the code maintainable.
7+
8+
This is a developer-facing document about the implementation structure in
9+
`src/libOpenImageIO/`. It does not describe the public API behavior of the
10+
algorithms.
11+
12+
13+
Goals and non-goals
14+
-------------------
15+
16+
Goals:
17+
- Make the hwy-backed code paths easy to read and easy to extend.
18+
- Centralize repetitive boilerplate (type conversion, tails, ROI pointer math).
19+
- Preserve OIIO's numeric semantics (normalized integer model).
20+
- Keep scalar fallbacks as the source of truth for tricky layout cases.
21+
22+
Non-goals:
23+
- Explain Highway itself. Refer to the upstream Highway documentation.
24+
- Guarantee that every ImageBufAlgo op has a hwy implementation.
25+
26+
27+
Where the code lives
28+
--------------------
29+
30+
Core helpers:
31+
- `src/libOpenImageIO/imagebufalgo_hwy_pvt.h`
32+
33+
Typical hwy call sites:
34+
- `src/libOpenImageIO/imagebufalgo_addsub.cpp`
35+
- `src/libOpenImageIO/imagebufalgo_muldiv.cpp`
36+
- `src/libOpenImageIO/imagebufalgo_mad.cpp`
37+
- `src/libOpenImageIO/imagebufalgo_pixelmath.cpp`
38+
- `src/libOpenImageIO/imagebufalgo_xform.cpp` (some ops are hwy-accelerated)
39+
40+
41+
Enabling and gating the hwy path
42+
-------------------------------
43+
44+
The hwy path is only used when:
45+
- Highway usage is enabled at runtime (`OIIO::pvt::enable_hwy`).
46+
- The relevant `ImageBuf` objects have local pixel storage (`localpixels()` is
47+
non-null), meaning the data is in process memory rather than accessed through
48+
an `ImageCache` tile abstraction.
49+
- The operation can be safely expressed as contiguous streams of pixels/channels
50+
for the hot path, or the code falls back to a scalar implementation for
51+
strided/non-contiguous layouts.
52+
53+
The common gating pattern looks like:
54+
- In a typed `*_impl` dispatcher: check `OIIO::pvt::enable_hwy` and `localpixels`
55+
and then call a `*_impl_hwy` function; otherwise call `*_impl_scalar`.
56+
57+
Important: the hwy path is an optimization. Correctness must not depend on hwy.
58+
59+
60+
OIIO numeric semantics: why we promote to float
61+
----------------------------------------------
62+
63+
OIIO treats integer image pixels as normalized values:
64+
- Unsigned integers represent [0, 1].
65+
- Signed integers represent approximately [-1, 1] with clamping for INT_MIN.
66+
67+
Therefore, most pixel math must be performed in float (or double) space, even
68+
when the stored data is integer. This is why the hwy layer uses the
69+
"LoadPromote/Operate/DemoteStore" pattern.
70+
71+
For additional discussion (and pitfalls of saturating integer arithmetic), see:
72+
- `HIGHWAY_SATURATING_ANALYSIS.md`
73+
74+
75+
The core pattern: LoadPromote -> RunHwy* -> DemoteStore
76+
-------------------------------------------------------
77+
78+
The helper header `imagebufalgo_hwy_pvt.h` defines the reusable building blocks:
79+
80+
1) Computation type selection
81+
- `SimdMathType<T>` selects `float` for most types, and `double` only when
82+
the destination type is `double`.
83+
84+
Rationale:
85+
- Float math is significantly faster on many targets.
86+
- For OIIO, integer images are normalized to [0,1] (or ~[-1,1]), so float
87+
precision is sufficient for typical image processing workloads.
88+
89+
2) Load and promote (with normalization)
90+
- `LoadPromote(d, ptr)` and `LoadPromoteN(d, ptr, count)` load values and
91+
normalize integer ranges into the computation space.
92+
93+
Rationale:
94+
- Consolidates all normalization and conversion logic in one place.
95+
- Prevents subtle drift where each operation re-implements integer scaling.
96+
- Ensures tail handling ("N" variants) is correct and consistent.
97+
98+
3) Demote and store (with denormalization/clamp/round)
99+
- `DemoteStore(d, ptr, v)` and `DemoteStoreN(d, ptr, v, count)` reverse the
100+
normalization and store results in the destination pixel type.
101+
102+
Rationale:
103+
- Centralizes rounding and clamping behavior for all destination types.
104+
- Ensures output matches OIIO scalar semantics.
105+
106+
4) Generic kernel runners (streaming arrays)
107+
- `RunHwyUnaryCmd`, `RunHwyCmd` (binary), `RunHwyTernaryCmd`
108+
- These are the primary entry points for most hwy kernels.
109+
110+
Rationale:
111+
- Encapsulates lane iteration and tail processing once.
112+
- The call sites only provide the per-lane math lambda, not the boilerplate.
113+
114+
115+
Native integer runners: when they are valid
116+
-------------------------------------------
117+
118+
Some operations are "scale-invariant" under OIIO's normalized integer model.
119+
For example, for unsigned integer add:
120+
- `(a/max + b/max)` in float space, then clamped to [0,1], then scaled by max
121+
matches saturated integer add `SaturatedAdd(a, b)` for the same bit depth.
122+
123+
For those cases, `imagebufalgo_hwy_pvt.h` provides:
124+
- `RunHwyUnaryNativeInt<T>`
125+
- `RunHwyBinaryNativeInt<T>`
126+
127+
These should only be used when all of the following are true:
128+
- The operation is known to be scale-invariant under the normalization model.
129+
- Input and output types are the same integral type.
130+
- The operation does not depend on mixed types or float-range behavior.
131+
132+
Rationale:
133+
- Avoids promotion/demotion overhead and can be materially faster.
134+
- Must be opt-in and explicit, because many operations are NOT compatible with
135+
raw integer arithmetic (e.g. multiplication, division, pow).
136+
137+
138+
Local pixel pointer helpers: reducing boilerplate safely
139+
-------------------------------------------------------
140+
141+
Most hwy call sites need repeated pointer and stride computations:
142+
- Pixel size in bytes.
143+
- Scanline size in bytes.
144+
- Base pointer to local pixels.
145+
- Per-row pointer for a given ROI and scanline.
146+
- Per-pixel pointer for non-contiguous fallbacks.
147+
148+
To centralize that, `imagebufalgo_hwy_pvt.h` defines:
149+
- `HwyPixels(ImageBuf&)` and `HwyPixels(const ImageBuf&)`
150+
returning a small view (`HwyLocalPixelsView`) with:
151+
- base pointer (`std::byte*` / `const std::byte*`)
152+
- `pixel_bytes`, `scanline_bytes`
153+
- `xbegin`, `ybegin`, `nchannels`
154+
- `RoiNChannels(roi)` for `roi.chend - roi.chbegin`
155+
- `ChannelsContiguous<T>(view, nchannels)`:
156+
true only when the pixel stride exactly equals `nchannels * sizeof(T)`
157+
- `PixelBase(view, x, y)`, `ChannelPtr<T>(view, x, y, ch)`
158+
- `RoiRowPtr<T>(view, y, roi)` for the start of the ROI row at `roi.xbegin` and
159+
`roi.chbegin`.
160+
161+
Rationale:
162+
- Avoids duplicating fragile byte-offset math across many ops.
163+
- Makes it visually obvious what the code is doing: "get row pointer" vs
164+
"compute offset by hand."
165+
- Makes non-contiguous fallback paths less error-prone by reusing the same
166+
pointer computations.
167+
168+
Important: these helpers are only valid for `ImageBuf` instances with local
169+
pixels (`localpixels()` non-null). The call sites must check that before using
170+
them.
171+
172+
173+
Contiguous fast path vs non-contiguous fallback
174+
-----------------------------------------------
175+
176+
Most operations implement two paths:
177+
178+
1) Contiguous fast path:
179+
- Used when pixels are tightly packed for the ROI's channel range.
180+
- The operation is executed as a 1D stream of length:
181+
`roi.width() * (roi.chend - roi.chbegin)`
182+
- Uses `RunHwy*Cmd` (or native-int runner) and benefits from:
183+
- fewer branches
184+
- fewer pointer computations
185+
- auto tail handling
186+
187+
2) Non-contiguous fallback:
188+
- Used when pixels have padding, unusual strides, or channel subsets that do
189+
not form a dense stream.
190+
- Typically loops pixel-by-pixel and channel-by-channel.
191+
- May still use the `ChannelPtr` helpers to compute correct addresses.
192+
193+
Rationale:
194+
- The contiguous path is where SIMD delivers large gains.
195+
- Trying to SIMD-optimize arbitrary strided layouts often increases complexity
196+
and risk for marginal benefit. Keeping a scalar fallback preserves
197+
correctness and maintainability.
198+
199+
200+
How to add a new hwy kernel
201+
---------------------------
202+
203+
Step 1: Choose the kernel shape
204+
- Unary: `R = f(A)` -> use `RunHwyUnaryCmd`
205+
- Binary: `R = f(A, B)` -> use `RunHwyCmd`
206+
- Ternary: `R = f(A, B, C)` -> use `RunHwyTernaryCmd`
207+
208+
Step 2: Decide if a native-int fast path is valid
209+
- Only for scale-invariant ops and same-type integral inputs/outputs.
210+
- Use `RunHwyUnaryNativeInt` / `RunHwyBinaryNativeInt` when safe.
211+
- Otherwise, always use the promote/demote runners.
212+
213+
Step 3: Implement the hwy body with a contig check
214+
Typical structure inside `*_impl_hwy`:
215+
- Acquire views once:
216+
- `auto Rv = HwyPixels(R);`
217+
- `auto Av = HwyPixels(A);` etc.
218+
- In the parallel callback:
219+
- compute `nchannels = RoiNChannels(roi)`
220+
- compute `contig = ChannelsContiguous<...>(...)` for each image
221+
- for each scanline y:
222+
- `Rtype* r_row = RoiRowPtr<Rtype>(Rv, y, roi);`
223+
- `const Atype* a_row = RoiRowPtr<Atype>(Av, y, roi);` etc.
224+
- if contig: call `RunHwy*` with `n = roi.width() * nchannels`
225+
- else: fall back per pixel, per channel
226+
227+
Step 4: Keep the scalar path as the reference
228+
- The scalar implementation should remain correct for all layouts and types.
229+
- The hwy path should match scalar results for supported cases.
230+
231+
232+
Design rationale summary
233+
------------------------
234+
235+
This design intentionally separates concerns:
236+
- Type conversion and normalization are centralized (`LoadPromote`,
237+
`DemoteStore`).
238+
- SIMD lane iteration and tail handling are centralized (`RunHwy*` runners).
239+
- Image address computations are centralized (`HwyPixels`, `RoiRowPtr`,
240+
`ChannelPtr`).
241+
- Operation-specific code is reduced to short lambdas expressing the math.
242+
243+
This makes the hwy layer:
244+
- Easier to maintain: fewer places to fix bugs when semantics change.
245+
- Easier to extend: adding an op mostly means writing the math lambda and the
246+
dispatch glue.
247+
- Safer: correctness for unusual layouts remains in scalar fallbacks.
248+
249+
250+
Notes on `half`
251+
---------------
252+
253+
The hwy conversion helpers handle `half` by converting through
254+
`hwy::float16_t`. This currently assumes the underlying `half` representation
255+
is compatible with how Highway loads/stores 16-bit floats.
256+
257+
If this assumption is revisited in the future, it should be changed as a
258+
separate, explicit correctness/performance project.
259+
260+
261+
<!-- SPDX-License-Identifier: CC-BY-4.0 -->
262+
<!-- Copyright Contributors to the OpenImageIO Project. -->
263+
264+

src/cmake/externalpackages.cmake

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,11 @@ if (USE_QT AND OPENGL_FOUND)
226226
endif ()
227227

228228

229+
# Google Highway for SIMD (optional optimization)
230+
if (OIIO_USE_HWY)
231+
checked_find_package (hwy)
232+
endif ()
233+
229234
# Tessil/robin-map
230235
checked_find_package (Robinmap REQUIRED
231236
VERSION_MIN 1.2.0

src/doc/imagebufalgo.rst

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,68 @@ the computation without spawning additional threads, which might tend to
152152
crowd out the other application threads.
153153

154154

155+
SIMD Performance and Data Types
156+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157+
158+
Many ImageBufAlgo operations use SIMD (Single Instruction, Multiple Data)
159+
optimizations powered by the Google Highway library to achieve significant
160+
performance improvements, particularly for integer image formats.
161+
162+
**Integer Type Optimizations:**
163+
164+
OpenImageIO treats all integer images as normalized Standard Dynamic Range
165+
(SDR) data:
166+
167+
* Unsigned integers (``uint8``, ``uint16``, ``uint32``, ``uint64``) are
168+
normalized to the [0.0, 1.0] range: ``float_value = int_value / max_value``
169+
* Signed integers (``int8``, ``int16``, ``int32``, ``int64``) are normalized
170+
to approximately the [-1.0, 1.0] range: ``float_value = int_value / max_value``
171+
172+
Most ImageBufAlgo operations convert integer data to float, perform the
173+
operation, and convert back. Highway SIMD provides 3-5x speedup for these
174+
operations compared to scalar code.
175+
176+
**Scale-Invariant Operations:**
177+
178+
Certain operations are *scale-invariant*, meaning they produce identical
179+
results whether performed on raw integers or normalized floats. For these
180+
operations, OpenImageIO uses native integer SIMD paths that avoid float
181+
conversion entirely, achieving 6-12x speedup (2-3x faster than the float
182+
promotion path):
183+
184+
* ``add``, ``sub`` (with saturation)
185+
* ``min``, ``max``
186+
* ``abs``, ``absdiff``
187+
188+
These optimizations automatically activate when all input and output images
189+
have matching integer types (e.g., all ``uint8``). When types differ or when
190+
mixing integer and float images, the standard float promotion path is used.
191+
192+
**Controlling SIMD Optimizations:**
193+
194+
Highway SIMD is enabled by default. To disable it globally::
195+
196+
OIIO::attribute("enable_hwy", 0);
197+
198+
Or via environment variable::
199+
200+
export OPENIMAGEIO_ENABLE_HWY=0
201+
202+
This is primarily useful for debugging or performance comparison. In normal
203+
use, the optimizations should remain enabled for best performance.
204+
205+
**Performance Expectations:**
206+
207+
Typical speedups with Highway SIMD (compared to scalar code):
208+
209+
* Float operations: 3-5x faster
210+
* Integer operations (with float conversion): 3-5x faster
211+
* Integer scale-invariant operations (native int): 6-12x faster
212+
* Half-float operations: 3-5x faster
213+
214+
Actual performance depends on the specific operation, image size, data types,
215+
and hardware capabilities (AVX2, AVX-512, ARM NEON, etc.).
216+
155217

156218
.. _sec-iba-patterns:
157219

0 commit comments

Comments
 (0)