|
| 1 | +ImageBufAlgo Highway (hwy) Implementation Guide |
| 2 | +============================================== |
| 3 | + |
| 4 | +This document explains how OpenImageIO uses Google Highway (hwy) to accelerate |
| 5 | +selected `ImageBufAlgo` operations, and how to add or modify kernels in a way |
| 6 | +that preserves OIIO semantics while keeping the code maintainable. |
| 7 | + |
| 8 | +This is a developer-facing document about the implementation structure in |
| 9 | +`src/libOpenImageIO/`. It does not describe the public API behavior of the |
| 10 | +algorithms. |
| 11 | + |
| 12 | + |
| 13 | +Goals and non-goals |
| 14 | +------------------- |
| 15 | + |
| 16 | +Goals: |
| 17 | +- Make the hwy-backed code paths easy to read and easy to extend. |
| 18 | +- Centralize repetitive boilerplate (type conversion, tails, ROI pointer math). |
| 19 | +- Preserve OIIO's numeric semantics (normalized integer model). |
| 20 | +- Keep scalar fallbacks as the source of truth for tricky layout cases. |
| 21 | + |
| 22 | +Non-goals: |
| 23 | +- Explain Highway itself. Refer to the upstream Highway documentation. |
| 24 | +- Guarantee that every ImageBufAlgo op has a hwy implementation. |
| 25 | + |
| 26 | + |
| 27 | +Where the code lives |
| 28 | +-------------------- |
| 29 | + |
| 30 | +Core helpers: |
| 31 | +- `src/libOpenImageIO/imagebufalgo_hwy_pvt.h` |
| 32 | + |
| 33 | +Typical hwy call sites: |
| 34 | +- `src/libOpenImageIO/imagebufalgo_addsub.cpp` |
| 35 | +- `src/libOpenImageIO/imagebufalgo_muldiv.cpp` |
| 36 | +- `src/libOpenImageIO/imagebufalgo_mad.cpp` |
| 37 | +- `src/libOpenImageIO/imagebufalgo_pixelmath.cpp` |
| 38 | +- `src/libOpenImageIO/imagebufalgo_xform.cpp` (some ops are hwy-accelerated) |
| 39 | + |
| 40 | + |
| 41 | +Enabling and gating the hwy path |
| 42 | +------------------------------- |
| 43 | + |
| 44 | +The hwy path is only used when: |
| 45 | +- Highway usage is enabled at runtime (`OIIO::pvt::enable_hwy`). |
| 46 | +- The relevant `ImageBuf` objects have local pixel storage (`localpixels()` is |
| 47 | + non-null), meaning the data is in process memory rather than accessed through |
| 48 | + an `ImageCache` tile abstraction. |
| 49 | +- The operation can be safely expressed as contiguous streams of pixels/channels |
| 50 | + for the hot path, or the code falls back to a scalar implementation for |
| 51 | + strided/non-contiguous layouts. |
| 52 | + |
| 53 | +The common gating pattern looks like: |
| 54 | +- In a typed `*_impl` dispatcher: check `OIIO::pvt::enable_hwy` and `localpixels` |
| 55 | + and then call a `*_impl_hwy` function; otherwise call `*_impl_scalar`. |
| 56 | + |
| 57 | +Important: the hwy path is an optimization. Correctness must not depend on hwy. |
| 58 | + |
| 59 | + |
| 60 | +OIIO numeric semantics: why we promote to float |
| 61 | +---------------------------------------------- |
| 62 | + |
| 63 | +OIIO treats integer image pixels as normalized values: |
| 64 | +- Unsigned integers represent [0, 1]. |
| 65 | +- Signed integers represent approximately [-1, 1] with clamping for INT_MIN. |
| 66 | + |
| 67 | +Therefore, most pixel math must be performed in float (or double) space, even |
| 68 | +when the stored data is integer. This is why the hwy layer uses the |
| 69 | +"LoadPromote/Operate/DemoteStore" pattern. |
| 70 | + |
| 71 | +For additional discussion (and pitfalls of saturating integer arithmetic), see: |
| 72 | +- `HIGHWAY_SATURATING_ANALYSIS.md` |
| 73 | + |
| 74 | + |
| 75 | +The core pattern: LoadPromote -> RunHwy* -> DemoteStore |
| 76 | +------------------------------------------------------- |
| 77 | + |
| 78 | +The helper header `imagebufalgo_hwy_pvt.h` defines the reusable building blocks: |
| 79 | + |
| 80 | +1) Computation type selection |
| 81 | + - `SimdMathType<T>` selects `float` for most types, and `double` only when |
| 82 | + the destination type is `double`. |
| 83 | + |
| 84 | + Rationale: |
| 85 | + - Float math is significantly faster on many targets. |
| 86 | + - For OIIO, integer images are normalized to [0,1] (or ~[-1,1]), so float |
| 87 | + precision is sufficient for typical image processing workloads. |
| 88 | + |
| 89 | +2) Load and promote (with normalization) |
| 90 | + - `LoadPromote(d, ptr)` and `LoadPromoteN(d, ptr, count)` load values and |
| 91 | + normalize integer ranges into the computation space. |
| 92 | + |
| 93 | + Rationale: |
| 94 | + - Consolidates all normalization and conversion logic in one place. |
| 95 | + - Prevents subtle drift where each operation re-implements integer scaling. |
| 96 | + - Ensures tail handling ("N" variants) is correct and consistent. |
| 97 | + |
| 98 | +3) Demote and store (with denormalization/clamp/round) |
| 99 | + - `DemoteStore(d, ptr, v)` and `DemoteStoreN(d, ptr, v, count)` reverse the |
| 100 | + normalization and store results in the destination pixel type. |
| 101 | + |
| 102 | + Rationale: |
| 103 | + - Centralizes rounding and clamping behavior for all destination types. |
| 104 | + - Ensures output matches OIIO scalar semantics. |
| 105 | + |
| 106 | +4) Generic kernel runners (streaming arrays) |
| 107 | + - `RunHwyUnaryCmd`, `RunHwyCmd` (binary), `RunHwyTernaryCmd` |
| 108 | + - These are the primary entry points for most hwy kernels. |
| 109 | + |
| 110 | + Rationale: |
| 111 | + - Encapsulates lane iteration and tail processing once. |
| 112 | + - The call sites only provide the per-lane math lambda, not the boilerplate. |
| 113 | + |
| 114 | + |
| 115 | +Native integer runners: when they are valid |
| 116 | +------------------------------------------- |
| 117 | + |
| 118 | +Some operations are "scale-invariant" under OIIO's normalized integer model. |
| 119 | +For example, for unsigned integer add: |
| 120 | +- `(a/max + b/max)` in float space, then clamped to [0,1], then scaled by max |
| 121 | + matches saturated integer add `SaturatedAdd(a, b)` for the same bit depth. |
| 122 | + |
| 123 | +For those cases, `imagebufalgo_hwy_pvt.h` provides: |
| 124 | +- `RunHwyUnaryNativeInt<T>` |
| 125 | +- `RunHwyBinaryNativeInt<T>` |
| 126 | + |
| 127 | +These should only be used when all of the following are true: |
| 128 | +- The operation is known to be scale-invariant under the normalization model. |
| 129 | +- Input and output types are the same integral type. |
| 130 | +- The operation does not depend on mixed types or float-range behavior. |
| 131 | + |
| 132 | +Rationale: |
| 133 | +- Avoids promotion/demotion overhead and can be materially faster. |
| 134 | +- Must be opt-in and explicit, because many operations are NOT compatible with |
| 135 | + raw integer arithmetic (e.g. multiplication, division, pow). |
| 136 | + |
| 137 | + |
| 138 | +Local pixel pointer helpers: reducing boilerplate safely |
| 139 | +------------------------------------------------------- |
| 140 | + |
| 141 | +Most hwy call sites need repeated pointer and stride computations: |
| 142 | +- Pixel size in bytes. |
| 143 | +- Scanline size in bytes. |
| 144 | +- Base pointer to local pixels. |
| 145 | +- Per-row pointer for a given ROI and scanline. |
| 146 | +- Per-pixel pointer for non-contiguous fallbacks. |
| 147 | + |
| 148 | +To centralize that, `imagebufalgo_hwy_pvt.h` defines: |
| 149 | +- `HwyPixels(ImageBuf&)` and `HwyPixels(const ImageBuf&)` |
| 150 | + returning a small view (`HwyLocalPixelsView`) with: |
| 151 | + - base pointer (`std::byte*` / `const std::byte*`) |
| 152 | + - `pixel_bytes`, `scanline_bytes` |
| 153 | + - `xbegin`, `ybegin`, `nchannels` |
| 154 | +- `RoiNChannels(roi)` for `roi.chend - roi.chbegin` |
| 155 | +- `ChannelsContiguous<T>(view, nchannels)`: |
| 156 | + true only when the pixel stride exactly equals `nchannels * sizeof(T)` |
| 157 | +- `PixelBase(view, x, y)`, `ChannelPtr<T>(view, x, y, ch)` |
| 158 | +- `RoiRowPtr<T>(view, y, roi)` for the start of the ROI row at `roi.xbegin` and |
| 159 | + `roi.chbegin`. |
| 160 | + |
| 161 | +Rationale: |
| 162 | +- Avoids duplicating fragile byte-offset math across many ops. |
| 163 | +- Makes it visually obvious what the code is doing: "get row pointer" vs |
| 164 | + "compute offset by hand." |
| 165 | +- Makes non-contiguous fallback paths less error-prone by reusing the same |
| 166 | + pointer computations. |
| 167 | + |
| 168 | +Important: these helpers are only valid for `ImageBuf` instances with local |
| 169 | +pixels (`localpixels()` non-null). The call sites must check that before using |
| 170 | +them. |
| 171 | + |
| 172 | + |
| 173 | +Contiguous fast path vs non-contiguous fallback |
| 174 | +----------------------------------------------- |
| 175 | + |
| 176 | +Most operations implement two paths: |
| 177 | + |
| 178 | +1) Contiguous fast path: |
| 179 | + - Used when pixels are tightly packed for the ROI's channel range. |
| 180 | + - The operation is executed as a 1D stream of length: |
| 181 | + `roi.width() * (roi.chend - roi.chbegin)` |
| 182 | + - Uses `RunHwy*Cmd` (or native-int runner) and benefits from: |
| 183 | + - fewer branches |
| 184 | + - fewer pointer computations |
| 185 | + - auto tail handling |
| 186 | + |
| 187 | +2) Non-contiguous fallback: |
| 188 | + - Used when pixels have padding, unusual strides, or channel subsets that do |
| 189 | + not form a dense stream. |
| 190 | + - Typically loops pixel-by-pixel and channel-by-channel. |
| 191 | + - May still use the `ChannelPtr` helpers to compute correct addresses. |
| 192 | + |
| 193 | +Rationale: |
| 194 | +- The contiguous path is where SIMD delivers large gains. |
| 195 | +- Trying to SIMD-optimize arbitrary strided layouts often increases complexity |
| 196 | + and risk for marginal benefit. Keeping a scalar fallback preserves |
| 197 | + correctness and maintainability. |
| 198 | + |
| 199 | + |
| 200 | +How to add a new hwy kernel |
| 201 | +--------------------------- |
| 202 | + |
| 203 | +Step 1: Choose the kernel shape |
| 204 | +- Unary: `R = f(A)` -> use `RunHwyUnaryCmd` |
| 205 | +- Binary: `R = f(A, B)` -> use `RunHwyCmd` |
| 206 | +- Ternary: `R = f(A, B, C)` -> use `RunHwyTernaryCmd` |
| 207 | + |
| 208 | +Step 2: Decide if a native-int fast path is valid |
| 209 | +- Only for scale-invariant ops and same-type integral inputs/outputs. |
| 210 | +- Use `RunHwyUnaryNativeInt` / `RunHwyBinaryNativeInt` when safe. |
| 211 | +- Otherwise, always use the promote/demote runners. |
| 212 | + |
| 213 | +Step 3: Implement the hwy body with a contig check |
| 214 | +Typical structure inside `*_impl_hwy`: |
| 215 | +- Acquire views once: |
| 216 | + - `auto Rv = HwyPixels(R);` |
| 217 | + - `auto Av = HwyPixels(A);` etc. |
| 218 | +- In the parallel callback: |
| 219 | + - compute `nchannels = RoiNChannels(roi)` |
| 220 | + - compute `contig = ChannelsContiguous<...>(...)` for each image |
| 221 | + - for each scanline y: |
| 222 | + - `Rtype* r_row = RoiRowPtr<Rtype>(Rv, y, roi);` |
| 223 | + - `const Atype* a_row = RoiRowPtr<Atype>(Av, y, roi);` etc. |
| 224 | + - if contig: call `RunHwy*` with `n = roi.width() * nchannels` |
| 225 | + - else: fall back per pixel, per channel |
| 226 | + |
| 227 | +Step 4: Keep the scalar path as the reference |
| 228 | +- The scalar implementation should remain correct for all layouts and types. |
| 229 | +- The hwy path should match scalar results for supported cases. |
| 230 | + |
| 231 | + |
| 232 | +Design rationale summary |
| 233 | +------------------------ |
| 234 | + |
| 235 | +This design intentionally separates concerns: |
| 236 | +- Type conversion and normalization are centralized (`LoadPromote`, |
| 237 | + `DemoteStore`). |
| 238 | +- SIMD lane iteration and tail handling are centralized (`RunHwy*` runners). |
| 239 | +- Image address computations are centralized (`HwyPixels`, `RoiRowPtr`, |
| 240 | + `ChannelPtr`). |
| 241 | +- Operation-specific code is reduced to short lambdas expressing the math. |
| 242 | + |
| 243 | +This makes the hwy layer: |
| 244 | +- Easier to maintain: fewer places to fix bugs when semantics change. |
| 245 | +- Easier to extend: adding an op mostly means writing the math lambda and the |
| 246 | + dispatch glue. |
| 247 | +- Safer: correctness for unusual layouts remains in scalar fallbacks. |
| 248 | + |
| 249 | + |
| 250 | +Notes on `half` |
| 251 | +--------------- |
| 252 | + |
| 253 | +The hwy conversion helpers handle `half` by converting through |
| 254 | +`hwy::float16_t`. This currently assumes the underlying `half` representation |
| 255 | +is compatible with how Highway loads/stores 16-bit floats. |
| 256 | + |
| 257 | +If this assumption is revisited in the future, it should be changed as a |
| 258 | +separate, explicit correctness/performance project. |
| 259 | + |
| 260 | + |
| 261 | +<!-- SPDX-License-Identifier: CC-BY-4.0 --> |
| 262 | +<!-- Copyright Contributors to the OpenImageIO Project. --> |
| 263 | + |
| 264 | + |
0 commit comments