Skip to content

Commit 252edbb

Browse files
ssh4netlgritz
authored andcommitted
Add Google Highway SIMD acceleration to ImageBufAlgo
Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library: • add/sub • mul/div • mad • resample Adds CMake and build system support, new implementation helpers, and developer documentation. Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
1 parent c75e31f commit 252edbb

16 files changed

Lines changed: 2646 additions & 75 deletions

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ else ()
112112
endif ()
113113
option (${PROJ_NAME}_BUILD_TOOLS "Build the command-line tools" ON)
114114
option (${PROJ_NAME}_BUILD_TESTS "Build the unit tests" ON)
115+
option (OIIO_USE_HWY "Enable experimental Google Highway SIMD optimizations (if Highway is available)" OFF)
115116
set (OIIO_LIBNAME_SUFFIX "" CACHE STRING
116117
"Optional name appended to ${PROJECT_NAME} libraries that are built")
117118
option (BUILD_OIIOUTIL_ONLY "If ON, will build *only* libOpenImageIO_Util" OFF)

docs/dev/Architecture.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,10 @@ objects. These algorithms include simple operations like copying, resizing,
117117
and compositing images, as well as more complex operations like color
118118
conversions, resizing, filtering, etc.
119119

120+
Some performance-critical `ImageBufAlgo` implementations have SIMD-accelerated
121+
paths using Google Highway. For implementation details and guidance for adding
122+
new kernels, see `docs/dev/ImageBufAlgo_Highway.md`.
123+
120124
## Image caching: TextureSystem and ImageCache
121125

122126
There are situations where ImageBuf is still not the right abstraction,

docs/dev/ImageBufAlgo_Highway.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
ImageBufAlgo Highway (hwy) Implementation Guide
2+
==============================================
3+
4+
This document explains how OpenImageIO uses Google Highway (hwy) to accelerate
5+
selected `ImageBufAlgo` operations, and how to add or modify kernels in a way
6+
that preserves OIIO semantics while keeping the code maintainable.
7+
8+
This is a developer-facing document about the implementation structure in
9+
`src/libOpenImageIO/`. It does not describe the public API behavior of the
10+
algorithms.
11+
12+
13+
Goals and non-goals
14+
-------------------
15+
16+
Goals:
17+
- Make the hwy-backed code paths easy to read and easy to extend.
18+
- Centralize repetitive boilerplate (type conversion, tails, ROI pointer math).
19+
- Preserve OIIO's numeric semantics (normalized integer model).
20+
- Keep scalar fallbacks as the source of truth for tricky layout cases.
21+
22+
Non-goals:
23+
- Explain Highway itself. Refer to the upstream Highway documentation.
24+
- Guarantee that every ImageBufAlgo op has a hwy implementation.
25+
26+
27+
Where the code lives
28+
--------------------
29+
30+
Core helpers:
31+
- `src/libOpenImageIO/imagebufalgo_hwy_pvt.h`
32+
33+
Typical hwy call sites:
34+
- `src/libOpenImageIO/imagebufalgo_addsub.cpp`
35+
- `src/libOpenImageIO/imagebufalgo_muldiv.cpp`
36+
- `src/libOpenImageIO/imagebufalgo_mad.cpp`
37+
- `src/libOpenImageIO/imagebufalgo_pixelmath.cpp`
38+
- `src/libOpenImageIO/imagebufalgo_xform.cpp` (some ops are hwy-accelerated)
39+
40+
41+
Enabling and gating the hwy path
42+
-------------------------------
43+
44+
The hwy path is only used when:
45+
- Highway usage is enabled at runtime (`OIIO::pvt::enable_hwy`).
46+
- The relevant `ImageBuf` objects have local pixel storage (`localpixels()` is
47+
non-null), meaning the data is in process memory rather than accessed through
48+
an `ImageCache` tile abstraction.
49+
- The operation can be safely expressed as contiguous streams of pixels/channels
50+
for the hot path, or the code falls back to a scalar implementation for
51+
strided/non-contiguous layouts.
52+
53+
The common gating pattern looks like:
54+
- In a typed `*_impl` dispatcher: check `OIIO::pvt::enable_hwy` and `localpixels`
55+
and then call a `*_impl_hwy` function; otherwise call `*_impl_scalar`.
56+
57+
Important: the hwy path is an optimization. Correctness must not depend on hwy.
58+
59+
60+
OIIO numeric semantics: why we promote to float
61+
----------------------------------------------
62+
63+
OIIO treats integer image pixels as normalized values:
64+
- Unsigned integers represent [0, 1].
65+
- Signed integers represent approximately [-1, 1] with clamping for INT_MIN.
66+
67+
Therefore, most pixel math must be performed in float (or double) space, even
68+
when the stored data is integer. This is why the hwy layer uses the
69+
"LoadPromote/Operate/DemoteStore" pattern.
70+
71+
For additional discussion (and pitfalls of saturating integer arithmetic), see:
72+
- `HIGHWAY_SATURATING_ANALYSIS.md`
73+
74+
75+
The core pattern: LoadPromote -> RunHwy* -> DemoteStore
76+
-------------------------------------------------------
77+
78+
The helper header `imagebufalgo_hwy_pvt.h` defines the reusable building blocks:
79+
80+
1) Computation type selection
81+
- `SimdMathType<T>` selects `float` for most types, and `double` only when
82+
the destination type is `double`.
83+
84+
Rationale:
85+
- Float math is significantly faster on many targets.
86+
- For OIIO, integer images are normalized to [0,1] (or ~[-1,1]), so float
87+
precision is sufficient for typical image processing workloads.
88+
89+
2) Load and promote (with normalization)
90+
- `LoadPromote(d, ptr)` and `LoadPromoteN(d, ptr, count)` load values and
91+
normalize integer ranges into the computation space.
92+
93+
Rationale:
94+
- Consolidates all normalization and conversion logic in one place.
95+
- Prevents subtle drift where each operation re-implements integer scaling.
96+
- Ensures tail handling ("N" variants) is correct and consistent.
97+
98+
3) Demote and store (with denormalization/clamp/round)
99+
- `DemoteStore(d, ptr, v)` and `DemoteStoreN(d, ptr, v, count)` reverse the
100+
normalization and store results in the destination pixel type.
101+
102+
Rationale:
103+
- Centralizes rounding and clamping behavior for all destination types.
104+
- Ensures output matches OIIO scalar semantics.
105+
106+
4) Generic kernel runners (streaming arrays)
107+
- `RunHwyUnaryCmd`, `RunHwyCmd` (binary), `RunHwyTernaryCmd`
108+
- These are the primary entry points for most hwy kernels.
109+
110+
Rationale:
111+
- Encapsulates lane iteration and tail processing once.
112+
- The call sites only provide the per-lane math lambda, not the boilerplate.
113+
114+
115+
Native integer runners: when they are valid
116+
-------------------------------------------
117+
118+
Some operations are "scale-invariant" under OIIO's normalized integer model.
119+
For example, for unsigned integer add:
120+
- `(a/max + b/max)` in float space, then clamped to [0,1], then scaled by max
121+
matches saturated integer add `SaturatedAdd(a, b)` for the same bit depth.
122+
123+
For those cases, `imagebufalgo_hwy_pvt.h` provides:
124+
- `RunHwyUnaryNativeInt<T>`
125+
- `RunHwyBinaryNativeInt<T>`
126+
127+
These should only be used when all of the following are true:
128+
- The operation is known to be scale-invariant under the normalization model.
129+
- Input and output types are the same integral type.
130+
- The operation does not depend on mixed types or float-range behavior.
131+
132+
Rationale:
133+
- Avoids promotion/demotion overhead and can be materially faster.
134+
- Must be opt-in and explicit, because many operations are NOT compatible with
135+
raw integer arithmetic (e.g. multiplication, division, pow).
136+
137+
138+
Local pixel pointer helpers: reducing boilerplate safely
139+
-------------------------------------------------------
140+
141+
Most hwy call sites need repeated pointer and stride computations:
142+
- Pixel size in bytes.
143+
- Scanline size in bytes.
144+
- Base pointer to local pixels.
145+
- Per-row pointer for a given ROI and scanline.
146+
- Per-pixel pointer for non-contiguous fallbacks.
147+
148+
To centralize that, `imagebufalgo_hwy_pvt.h` defines:
149+
- `HwyPixels(ImageBuf&)` and `HwyPixels(const ImageBuf&)`
150+
returning a small view (`HwyLocalPixelsView`) with:
151+
- base pointer (`std::byte*` / `const std::byte*`)
152+
- `pixel_bytes`, `scanline_bytes`
153+
- `xbegin`, `ybegin`, `nchannels`
154+
- `RoiNChannels(roi)` for `roi.chend - roi.chbegin`
155+
- `ChannelsContiguous<T>(view, nchannels)`:
156+
true only when the pixel stride exactly equals `nchannels * sizeof(T)`
157+
- `PixelBase(view, x, y)`, `ChannelPtr<T>(view, x, y, ch)`
158+
- `RoiRowPtr<T>(view, y, roi)` for the start of the ROI row at `roi.xbegin` and
159+
`roi.chbegin`.
160+
161+
Rationale:
162+
- Avoids duplicating fragile byte-offset math across many ops.
163+
- Makes it visually obvious what the code is doing: "get row pointer" vs
164+
"compute offset by hand."
165+
- Makes non-contiguous fallback paths less error-prone by reusing the same
166+
pointer computations.
167+
168+
Important: these helpers are only valid for `ImageBuf` instances with local
169+
pixels (`localpixels()` non-null). The call sites must check that before using
170+
them.
171+
172+
173+
Contiguous fast path vs non-contiguous fallback
174+
-----------------------------------------------
175+
176+
Most operations implement two paths:
177+
178+
1) Contiguous fast path:
179+
- Used when pixels are tightly packed for the ROI's channel range.
180+
- The operation is executed as a 1D stream of length:
181+
`roi.width() * (roi.chend - roi.chbegin)`
182+
- Uses `RunHwy*Cmd` (or native-int runner) and benefits from:
183+
- fewer branches
184+
- fewer pointer computations
185+
- auto tail handling
186+
187+
2) Non-contiguous fallback:
188+
- Used when pixels have padding, unusual strides, or channel subsets that do
189+
not form a dense stream.
190+
- Typically loops pixel-by-pixel and channel-by-channel.
191+
- May still use the `ChannelPtr` helpers to compute correct addresses.
192+
193+
Rationale:
194+
- The contiguous path is where SIMD delivers large gains.
195+
- Trying to SIMD-optimize arbitrary strided layouts often increases complexity
196+
and risk for marginal benefit. Keeping a scalar fallback preserves
197+
correctness and maintainability.
198+
199+
200+
How to add a new hwy kernel
201+
---------------------------
202+
203+
Step 1: Choose the kernel shape
204+
- Unary: `R = f(A)` -> use `RunHwyUnaryCmd`
205+
- Binary: `R = f(A, B)` -> use `RunHwyCmd`
206+
- Ternary: `R = f(A, B, C)` -> use `RunHwyTernaryCmd`
207+
208+
Step 2: Decide if a native-int fast path is valid
209+
- Only for scale-invariant ops and same-type integral inputs/outputs.
210+
- Use `RunHwyUnaryNativeInt` / `RunHwyBinaryNativeInt` when safe.
211+
- Otherwise, always use the promote/demote runners.
212+
213+
Step 3: Implement the hwy body with a contig check
214+
Typical structure inside `*_impl_hwy`:
215+
- Acquire views once:
216+
- `auto Rv = HwyPixels(R);`
217+
- `auto Av = HwyPixels(A);` etc.
218+
- In the parallel callback:
219+
- compute `nchannels = RoiNChannels(roi)`
220+
- compute `contig = ChannelsContiguous<...>(...)` for each image
221+
- for each scanline y:
222+
- `Rtype* r_row = RoiRowPtr<Rtype>(Rv, y, roi);`
223+
- `const Atype* a_row = RoiRowPtr<Atype>(Av, y, roi);` etc.
224+
- if contig: call `RunHwy*` with `n = roi.width() * nchannels`
225+
- else: fall back per pixel, per channel
226+
227+
Step 4: Keep the scalar path as the reference
228+
- The scalar implementation should remain correct for all layouts and types.
229+
- The hwy path should match scalar results for supported cases.
230+
231+
232+
Design rationale summary
233+
------------------------
234+
235+
This design intentionally separates concerns:
236+
- Type conversion and normalization are centralized (`LoadPromote`,
237+
`DemoteStore`).
238+
- SIMD lane iteration and tail handling are centralized (`RunHwy*` runners).
239+
- Image address computations are centralized (`HwyPixels`, `RoiRowPtr`,
240+
`ChannelPtr`).
241+
- Operation-specific code is reduced to short lambdas expressing the math.
242+
243+
This makes the hwy layer:
244+
- Easier to maintain: fewer places to fix bugs when semantics change.
245+
- Easier to extend: adding an op mostly means writing the math lambda and the
246+
dispatch glue.
247+
- Safer: correctness for unusual layouts remains in scalar fallbacks.
248+
249+
250+
Notes on `half`
251+
---------------
252+
253+
The hwy conversion helpers handle `half` by converting through
254+
`hwy::float16_t`. This currently assumes the underlying `half` representation
255+
is compatible with how Highway loads/stores 16-bit floats.
256+
257+
If this assumption is revisited in the future, it should be changed as a
258+
separate, explicit correctness/performance project.
259+
260+
261+
<!-- SPDX-License-Identifier: CC-BY-4.0 -->
262+
<!-- Copyright Contributors to the OpenImageIO Project. -->
263+
264+

src/cmake/externalpackages.cmake

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,11 @@ if (USE_QT AND OPENGL_FOUND)
226226
endif ()
227227

228228

229+
# Google Highway for SIMD (optional optimization)
230+
if (OIIO_USE_HWY)
231+
checked_find_package (hwy)
232+
endif ()
233+
229234
# Tessil/robin-map
230235
checked_find_package (Robinmap REQUIRED
231236
VERSION_MIN 1.2.0

src/doc/imagebufalgo.rst

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,68 @@ the computation without spawning additional threads, which might tend to
152152
crowd out the other application threads.
153153

154154

155+
SIMD Performance and Data Types
156+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157+
158+
Many ImageBufAlgo operations use SIMD (Single Instruction, Multiple Data)
159+
optimizations powered by the Google Highway library to achieve significant
160+
performance improvements, particularly for integer image formats.
161+
162+
**Integer Type Optimizations:**
163+
164+
OpenImageIO treats all integer images as normalized Standard Dynamic Range
165+
(SDR) data:
166+
167+
* Unsigned integers (``uint8``, ``uint16``, ``uint32``, ``uint64``) are
168+
normalized to the [0.0, 1.0] range: ``float_value = int_value / max_value``
169+
* Signed integers (``int8``, ``int16``, ``int32``, ``int64``) are normalized
170+
to approximately the [-1.0, 1.0] range: ``float_value = int_value / max_value``
171+
172+
Most ImageBufAlgo operations convert integer data to float, perform the
173+
operation, and convert back. Highway SIMD provides 3-5x speedup for these
174+
operations compared to scalar code.
175+
176+
**Scale-Invariant Operations:**
177+
178+
Certain operations are *scale-invariant*, meaning they produce identical
179+
results whether performed on raw integers or normalized floats. For these
180+
operations, OpenImageIO uses native integer SIMD paths that avoid float
181+
conversion entirely, achieving 6-12x speedup (2-3x faster than the float
182+
promotion path):
183+
184+
* ``add``, ``sub`` (with saturation)
185+
* ``min``, ``max``
186+
* ``abs``, ``absdiff``
187+
188+
These optimizations automatically activate when all input and output images
189+
have matching integer types (e.g., all ``uint8``). When types differ or when
190+
mixing integer and float images, the standard float promotion path is used.
191+
192+
**Controlling SIMD Optimizations:**
193+
194+
Highway SIMD is enabled by default. To disable it globally::
195+
196+
OIIO::attribute("enable_hwy", 0);
197+
198+
Or via environment variable::
199+
200+
export OPENIMAGEIO_ENABLE_HWY=0
201+
202+
This is primarily useful for debugging or performance comparison. In normal
203+
use, the optimizations should remain enabled for best performance.
204+
205+
**Performance Expectations:**
206+
207+
Typical speedups with Highway SIMD (compared to scalar code):
208+
209+
* Float operations: 3-5x faster
210+
* Integer operations (with float conversion): 3-5x faster
211+
* Integer scale-invariant operations (native int): 6-12x faster
212+
* Half-float operations: 3-5x faster
213+
214+
Actual performance depends on the specific operation, image size, data types,
215+
and hardware capabilities (AVX2, AVX-512, ARM NEON, etc.).
216+
155217

156218
.. _sec-iba-patterns:
157219

0 commit comments

Comments
 (0)