diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000000..7dc163fd70
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,119 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Build System
+
+OpenVDB uses CMake (minimum 3.24) and requires out-of-source builds.
+
+**Minimal core build:**
+```bash
+mkdir build && cd build
+cmake .. -DOPENVDB_BUILD_UNITTESTS=ON
+make -j$(nproc)
+```
+
+**Using the CI build script (recommended for full builds):**
+```bash
+./ci/build.sh --build-type=Release \
+  --components="core,test" \
+  --cargs="-DOPENVDB_ABI_VERSION_NUMBER=13"
+```
+
+Component flags for `--components`: `core`, `python`, `bin`, `view`, `render`, `test`, `hou`, `axcore`, `nano`, `nanotest`
+
+**Key CMake options:**
+| Option | Default | Description |
+|--------|---------|-------------|
+| `OPENVDB_BUILD_CORE` | ON | Core library |
+| `OPENVDB_BUILD_UNITTESTS` | OFF | Unit tests |
+| `OPENVDB_BUILD_NANOVDB` | OFF | NanoVDB |
+| `OPENVDB_BUILD_AX` | OFF | OpenVDB AX |
+| `OPENVDB_ABI_VERSION_NUMBER` | 13 | ABI version (6–13) |
+| `OPENVDB_CXX_STRICT` | OFF | Strict warnings |
+| `NANOVDB_USE_CUDA` | OFF | CUDA support for NanoVDB |
+
+## Running Tests
+
+```bash
+cd build
+ctest -V                          # all tests
+ctest -V -R TestGrid              # single test by name
+```
+
+To build only specific unit tests (avoids full rebuild):
+```bash
+cmake .. -DOPENVDB_TESTS="Grid;Tree;LeafNode"
+```
+
+Tests use Google Test (minimum 1.10). Test sources live in:
+- `openvdb/openvdb/unittest/` — core library tests (`TestFoo.cc` pattern)
+- `nanovdb/nanovdb/unittest/` — NanoVDB tests
+
+## Code Architecture
+
+### Repository Layout
+
+```
+openvdb/openvdb/        Core OpenVDB library
+  tree/                 Tree node hierarchy (RootNode, InternalNode, LeafNode)
+  tools/                Algorithm implementations (level sets, CSG, smoothing, etc.)
+  math/                 Math primitives (Vec, Mat, Quat, Transform, BBox)
+  io/                   VDB file format I/O
+  points/               Point data grids
+  python/               Python bindings (nanobind)
+  unittest/             Unit tests
+
+nanovdb/nanovdb/        NanoVDB — compact, GPU-friendly VDB subset
+  tools/                CPU algorithms
+  tools/cuda/           CUDA kernels
+  examples/             Standalone example programs
+
+openvdb_ax/openvdb_ax/  OpenVDB AX — JIT expression language for VDB operations
+  ast/                  Abstract syntax tree
+  codegen/              LLVM code generation
+  compiler/             Compilation pipeline
+
+openvdb_cmd/            Command-line tools (vdb_print, vdb_lod, vdb_tool, vdb_view, vdb_render)
+openvdb_houdini/        Houdini plugin
+openvdb_maya/           Maya plugin
+cmake/                  CMake find-modules and configuration
+ci/                     CI build/install scripts
+```
+
+### Core Data Model
+
+OpenVDB uses a **B+tree-like hierarchical sparse data structure**:
+- `Grid<TreeType>` — top-level container with transform and metadata
+- `Tree` — composed of `RootNode → InternalNode(s) → LeafNode`
+- Leaf nodes are 8×8×8 voxel blocks; internal nodes are 16³ and 32³ by default
+- `ValueAccessor` caches tree traversal paths for repeated access patterns
+- `GridBase` / `TypedGrid` provide the runtime-polymorphic/compile-time-typed split
+
+### NanoVDB vs OpenVDB
+
+NanoVDB is a read-optimized, single-allocation, GPU-portable subset of OpenVDB. It cannot be modified after construction. The `nanovdb/tools/CreateNanoGrid.h` and adjacent files handle conversion from OpenVDB grids to NanoVDB grids. The current branch (`mesh-to-grid`) adds `tools/cuda/MeshToGrid.cuh` for direct CUDA mesh-to-NanoVDB conversion.
+
+### OpenVDB AX
+
+AX compiles a domain-specific expression language to LLVM IR for execution over OpenVDB volumes and point grids. The pipeline is: source string → AST (`ast/`) → typed analysis → LLVM codegen (`codegen/`) → JIT execution via `compiler/`.
+
+## C++ Standard and ABI
+
+- Requires C++17 minimum
+- ABI version is set at compile time via `OPENVDB_ABI_VERSION_NUMBER`; the current version is 13
+- Headers are in `openvdb/openvdb/` and installed to `include/openvdb/`
+
+## Dependencies
+
+Core: Boost ≥ 1.82, TBB ≥ 2020.3, Blosc ≥ 1.17, OpenEXR/Imath ≥ 3.2, zlib ≥ 1.2.7
+Tests: GTest ≥ 1.10
+Python bindings: Python ≥ 3.11, nanobind ≥ 2.5.0
+NanoVDB GPU: CUDA toolkit
+AX: LLVM
+
+On Linux, ASWF Docker containers (used by CI) bundle most dependencies. See `ci/install_macos.sh` and `ci/install_windows.ps1` for platform-specific setup.
+
+## Coding Standards
+
+Follow the style guide at https://www.openvdb.org/documentation/doxygen/codingStyle.html. Contributions require a Developer Certificate of Origin sign-off (`git commit -s`) and a CLA on file — see CONTRIBUTING.md.
diff --git a/nanovdb/nanovdb/NanoVDB.h b/nanovdb/nanovdb/NanoVDB.h
index 78955cd251..8ed16899c4 100644
--- a/nanovdb/nanovdb/NanoVDB.h
+++ b/nanovdb/nanovdb/NanoVDB.h
@@ -4136,15 +4136,26 @@ struct NANOVDB_ALIGN(NANOVDB_DATA_ALIGNMENT) LeafData<ValueOnIndex, CoordT, Mask
     __hostdev__ uint64_t getMax() const { return this->hasStats() ? this->lastOffset() + 2u : 0u; }
     __hostdev__ uint64_t getAvg() const { return this->hasStats() ? this->lastOffset() + 3u : 0u; }
     __hostdev__ uint64_t getDev() const { return this->hasStats() ? this->lastOffset() + 4u : 0u; }
+    // Default branchless; define NANOVDB_USE_BRANCHY_GETVALUE to restore the
+    // pre-2026 branchy form.
     __hostdev__ uint64_t getValue(uint32_t i) const
     {
-        //return mValueMask.isOn(i) ? mOffset + mValueMask.countOn(i) : 0u;// for debugging
+#ifdef NANOVDB_USE_BRANCHY_GETVALUE
         uint32_t       n = i >> 6;
         const uint64_t w = BaseT::mValueMask.words()[n], mask = uint64_t(1) << (i & 63u);
-        if (!(w & mask)) return uint64_t(0); // if i'th value is inactive return offset to background value
+        if (!(w & mask)) return uint64_t(0);
         uint64_t sum  = BaseT::mOffset + util::countOn(w & (mask - 1u));
         if (n--) sum += BaseT::mPrefixSum >> (9u * n) & 511u;
         return sum;
+#else
+        const uint32_t n      = i >> 6;
+        const uint64_t w      = BaseT::mValueMask.words()[n];
+        const uint64_t bit    = uint64_t(1) << (i & 63u);
+        const uint64_t prefix = n == 0u ? uint64_t(0)
+                                        : (BaseT::mPrefixSum >> (9u * (n - 1u))) & 511u;
+        const uint64_t sum    = BaseT::mOffset + prefix + util::countOn(w & (bit - 1u));
+        return ((w & bit) ? ~uint64_t(0) : uint64_t(0)) & sum;
+#endif
     }
 }; // LeafData<ValueOnIndex>
 
diff --git a/nanovdb/nanovdb/examples/CMakeLists.txt b/nanovdb/nanovdb/examples/CMakeLists.txt
index 780a4f59bd..1c337b7f48 100644
--- a/nanovdb/nanovdb/examples/CMakeLists.txt
+++ b/nanovdb/nanovdb/examples/CMakeLists.txt
@@ -114,6 +114,22 @@ nanovdb_example(NAME "ex_merge_nanovdb_cuda" OPENVDB)
 nanovdb_example(NAME "ex_refine_nanovdb_cuda" OPENVDB)
 nanovdb_example(NAME "ex_coarsen_nanovdb_cuda" OPENVDB)
 
+nanovdb_example(NAME "ex_voxelBlockManager_host_cuda")
+if(TARGET ex_voxelBlockManager_host_cuda)
+  target_compile_options(ex_voxelBlockManager_host_cuda PRIVATE
+    $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-mavx2,-fopenmp-simd>
+    $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -fopenmp-simd>)
+endif()
+
+# End-to-end CPU WENO5 norm-square-gradient on a narrow-band level set,
+# with a scalar reference for correctness validation.
+nanovdb_example(NAME "ex_weno_nanovdb_cpu" OPENVDB)
+if(TARGET ex_weno_nanovdb_cpu)
+  target_compile_options(ex_weno_nanovdb_cpu PRIVATE -march=native -fopenmp-simd)
+  target_include_directories(ex_weno_nanovdb_cpu PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}/../../..)
+endif()
+
 if(CUDAToolkit_FOUND)
   nanovdb_example(NAME "ex_make_mgpu_nanovdb") # requires cuRAND
   target_link_libraries(ex_make_mgpu_nanovdb PRIVATE CUDA::curand)
diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VBMImplementationKnowledge.md b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VBMImplementationKnowledge.md
new file mode 100644
index 0000000000..105e769b9a
--- /dev/null
+++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VBMImplementationKnowledge.md
@@ -0,0 +1,337 @@
+# VBM Implementation Knowledge Base
+
+This document captures non-obvious design decisions, rejected alternatives, invariants,
+and performance phenomenology for the VoxelBlockManager (VBM) subsystem and its CPU
+port of `decodeInverseMaps`.  It is written as dense, structured, agent-consumable
+facts rather than narrative.  It complements `VoxelBlockManagerContext.md` (semantics
+and API).
+
+---
+
+## 1. NanoVDB Leaf Mask Layout — Ambient Space
+
+A `Mask<3>` stores 512 bits as 8 `uint64_t` words (`maskWords[x]`, x = 0..7).
+The NanoVDB linear voxel index within a leaf is `i = x*64 + y*8 + z`, where:
+- x = word index (0..7)
+- y = byte index within the word (0..7) — bits 8y..8y+7 of `maskWords[x]`
+- z = bit index within the byte (0..7)
+
+**Invariant**: word x covers leaf-local positions x*64 .. x*64+63.  The bit at
+position `y*8+z` within word x corresponds to leaf-local index `x*64 + y*8 + z`.
+
+**Why this matters**: every algorithm that operates on the mask must respect this
+layout.  The "fast" dimension (z, innermost) is the bit dimension within a byte.
+The "slow" dimension (y) is the byte dimension within a word.  The "word" dimension
+(x) is independent across the 8 words.
+
+---
+
+## 2. mPrefixSum Encoding
+
+`LeafIndexBase::mPrefixSum` (a `uint64_t`) encodes 7 exclusive cumulative popcounts
+at the word boundaries, packed as 7 x 9-bit fields:
+
+    bits 9*(x-1) .. 9*(x-1)+8  =  total active voxels in words 0..x-1,  x = 1..7
+
+- Field x-1 is the **exclusive** prefix at word x — i.e., how many active voxels
+  precede word x in the leaf.
+- xOffset[0] = 0 always (implicit, not stored).
+- Maximum value per field: 512 (all 512 voxels in preceding words active) — fits in
+  9 bits.
+- **Precondition for buildMaskPrefixSums**: the caller must pass the leaf's own
+  `mPrefixSum` field, not a recomputed value.  The field is set during grid
+  construction and is authoritative.
+
+---
+
+## 3. buildMaskPrefixSums — Output Semantics
+
+`util::buildMaskPrefixSums(mask, prefixSum, offsets[512])` produces:
+
+    offsets[i] = number of active voxels at leaf-local positions 0..i (inclusive)
+
+- Output is **inclusive** and **1-based**: for an active voxel at position i,
+  `offsets[i]` is its 1-based rank among all active voxels in the leaf.
+- Output is **leaf-local**: cross-word offsets from `mPrefixSum` are folded in,
+  but the global leaf-first-offset (`leaf.data()->firstOffset()`) is NOT added.
+  `offsets[i]` is in range [1, 512].
+- For an **inactive** voxel at position i: `offsets[i] == offsets[i-1]` (no
+  increment).  These values are valid and load-bearing for the `shfl_down`
+  compaction approach (see §6).
+- To recover the **exclusive** (0-based) rank of an active voxel at i:
+  `offsets[i] - 1`.
+- To recover the **global sequential index** of an active voxel at position i:
+  `leafFirstOffset - 1 + offsets[i]`
+  where `leafFirstOffset = leaf.data()->firstOffset()` is the 1-based global index
+  of the leaf's first active voxel.
+
+---
+
+## 4. transposeByteRow — What It Does and Why
+
+`transposeByteRow(src)` treats the low 8 bits of `src` as the first row of an 8x8
+bit matrix and returns the result of transposing it:
+
+    output.ui8[z] = (src >> z) & 1   for z = 0..7
+
+It is the single-row specialization of `transposeBits8x8`: if only the first row of
+an 8x8 bit matrix is non-zero, the full transpose reduces to this operation.
+
+**Why it is used in buildMaskPrefixSums (Step 1 — indicator fill)**:
+The algorithm needs `data[x][y].ui8[z] = indicator(x, y, z)` — a 0/1 value per
+(x, y, z) triple.  The indicator for (x, y, z) is bit `y*8+z` of `maskWords[x]`,
+which is bit z of byte y of `maskWords[x]`.  `transposeByteRow(maskWords[x] >> (y*8))`
+extracts byte y and places bit z into byte z of the result — exactly the required
+indicator layout.
+
+**Equivalent hardware instruction**: `_pdep_u64(src & 0xFF, kSpread)` on x86, where
+`kSpread = 0x0101010101010101`.  The software implementation is used for portability.
+
+**Why not `& kSpread` (the simpler alternative)**: `(maskWords[x] >> z) & kSpread`
+would extract bit z from each byte — which is what Plan #2 needed (data[x][z] layout).
+Plan #1 uses data[x][y] layout, requiring the byte dimension to be the output axis,
+not the input axis.  See §5 for the Plan #1 vs Plan #2 decision.
+
+---
+
+## 5. Plan #1 vs Plan #2 — The Rejected Alternative
+
+**Plan #2** would have used layout `data[x][z].ui8[y]`:
+- Simpler indicator fill: `data[x][z].ui64 = (maskWords[x] >> z) & kSpread`
+  (just a shift and mask — no `transposeByteRow` needed).
+- After the z-pass (Hillis-Steele within-uint64 over y), the output is in
+  `data[x][z]` order, not `data[x][y]` order.
+- **Fatal cost**: before zero-extending to uint16_t, a `transposeBytes8x8` call
+  per x-slice is required to reorder from `data[x][z].ui8[y]` to the required
+  linear output order.  This is ~200 instructions per call x 8 calls = ~1600
+  instructions of transpose overhead, dominating the ~14-cycle z+y passes.
+
+**Plan #1** (chosen) uses layout `data[x][y].ui8[z]`:
+- Indicator fill requires `transposeByteRow` — slightly more expensive than `& kSpread`.
+- After the z-pass and y-pass, the output is already in linear order (`x*64 + y*8 + z`).
+- **No output transpose**: zero-extension to uint16_t is a straight vpmovzxbw over
+  contiguous memory.
+
+**Decision criterion**: Plan #1 eliminates the expensive output transpose at the cost
+of a cheaper input transformation.  The output transpose (1024 bytes) is intrinsically
+more expensive than the input transformation (64 bytes).
+
+---
+
+## 6. Compaction Approaches — Inclusive vs Exclusive, Active vs Inactive
+
+Two approaches exist for building `leafLocalOffsets[j]` (local position of j-th active
+voxel) from `offsets`:
+
+**Scatter approach** (simple, used in decodeInverseMaps):
+  For each active position i: `leafLocalOffsets[offsets[i] - 1] = i`
+  Only requires `offsets[i]` at active positions.  Inactive positions are not read.
+
+**shfl_down approach** (deeper SIMD, see DecodeInverseMapsCPUPlan §4e):
+  Requires `shifts[i] = i - (offsets[i] - isActive(i))` for ALL positions, active
+  and inactive.  The `move[i]` predicate for each pass depends on `shifts[i]` even
+  for inactive voxels.  `buildMaskPrefixSums` correctly provides values at all 512
+  positions for this purpose.
+
+**Key invariant**: `offsets` from `buildMaskPrefixSums` is valid at all 512 positions,
+not just active ones.  This is intentional and necessary for the shfl_down path.
+
+---
+
+## 7. decodeInverseMaps CPU — Current Implementation and Performance History
+
+**Current implementation** (`VoxelBlockManager.h`, branch `vbm-cpu-port`):
+For each leaf overlapping the block:
+1. Early-exit `break` if `leafFirstOffset >= blockFirstOffset + BlockWidth` (monotonicity
+   invariant: all subsequent leaves also fall outside the block).
+2. Build `shifts[513]` directly via `buildMaskPrefixSums<true>`: `shifts[0]=0`,
+   `buildMaskPrefixSums<true>(..., shifts+1)` writes inclusive 0-bit counts into
+   `shifts[1..512]`, giving `shifts[i]` = exclusive count of inactive voxels in [0..i-1].
+   `leafValueCount = 512 - shifts[512]` as a free by-product.
+3. Run 9 in-place shuffle-down passes (Shift=1,2,4,...,256) via `util::shuffleDownMask` with a single buffer.
+4. Range fill `leafIndex[pStart..pEnd)` and contiguous copy from `leafLocalOffsets`.
+
+**Eliminated from earlier design**: the separate `prefixSums[513]` array and the explicit
+`shifts[i] = i - prefixSums[i]` subtraction loop.  `buildMaskPrefixSums<true>` writes
+`shifts[]` directly in one pass by running the prefix sum over the inverted mask words and
+adjusting the cross-word offsets as `64*x - ones` instead of `ones`.
+
+**Performance history (2M voxels / 16384 blocks / 25% occupancy / 24 OMP threads / AVX2)**:
+
+All figures below are wall-clock time for the full `decodeInverseMaps` loop over all
+16384 blocks.  Entries marked *unoptimized* were measured with `CMAKE_BUILD_TYPE` unset
+(effectively `-O0` host compilation via NVCC); entries marked *Release* used
+`-DCMAKE_BUILD_TYPE=Release` (`-O3 -DNDEBUG`).
+
+- Original `getValue()` loop (unoptimized): ~77 ms
+- `buildMaskPrefixSums` + bit-scan scatter (unoptimized): ~65 ms (~15% improvement)
+- `shuffleDownMask` without vectorization (-fopenmp missing in CUDA host flags, unoptimized): ~250 ms
+- `shuffleDownMask` with vectorization (two-buffer `__restrict__`, -fopenmp fixed, unoptimized): ~15-20 ms
+- `shuffleDownMask` single-buffer + `buildMaskPrefixSums<true>` (unoptimized): ~15-17 ms (no regression)
+- `shuffleDownMask` single-buffer + `buildMaskPrefixSums<true>` (**Release**): **~0.9-1.6 ms (~15x speedup)**
+
+**Critical finding — build type**: with `CMAKE_BUILD_TYPE` unset, NVCC compiles host
+code without optimization.  Template functions (`shuffleDownMask`, `buildMaskPrefixSums`)
+are not inlined, the `#pragma omp simd` loops are not vectorized, and all locals are
+spilled to the stack.  The resulting ~15-17 ms figure is not representative of real
+performance.  Always benchmark with `-DCMAKE_BUILD_TYPE=Release`.
+
+**Critical finding — -fopenmp**: `#pragma omp simd` is silently ignored for CUDA host
+code unless `-Xcompiler -fopenmp` is explicitly added.  Linking `OpenMP::OpenMP_CXX`
+does NOT automatically propagate compile flags to CUDA sources via CMake.  Without
+-fopenmp, the shuffle-down passes compile as scalar loops.
+
+**Why shuffleDownMask beats bit-scan with vectorization**:
+- Bit-scan is inherently scalar (data-dependent BSF/BSR instruction, variable trip count).
+- shuffleDownMask's 9 passes are fixed-width, data-independent loops over 512 elements.
+- With AVX2 (16 uint16_t per register), each pass takes ~32 vector ops vs ~128 scalar ops
+  for the bit-scan at 25% occupancy.
+
+**Recommended CMake invocation** (from the `build/` directory):
+```bash
+cmake .. \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DCMAKE_CUDA_ARCHITECTURES=89 \
+  -DOpenVDB_ROOT=/home/esifakis/local \
+  -DNANOVDB_USE_CUDA=ON \
+  -DNANOVDB_USE_OPENVDB=ON \
+  -DNANOVDB_USE_TBB=ON \
+  -DNANOVDB_BUILD_EXAMPLES=ON \
+  -DNANOVDB_BUILD_UNITTESTS=ON \
+  -DNANOVDB_BUILD_TOOLS=ON
+```
+
+---
+
+## 8. Critical Portability Notes
+
+**`UINT64_C` vs `UL` suffix**: always use `UINT64_C(...)` for 64-bit hex constants.
+`UL` is 32 bits on MSVC/Windows.  Occurrences of `UL`-suffixed 64-bit constants in
+`MorphologyHelpers.h` were corrected to `UINT64_C`.
+
+**`__restrict__` portability**: `__restrict__` (GCC/Clang) vs `__restrict` (MSVC).
+NanoVDB has no existing C++ portability macro for this.  `CNanoVDB.h` defines
+`RESTRICT` but only for the C API.  A `NANOVDB_RESTRICT` macro should be added
+if `__restrict__` is used in host-only headers.
+
+**`#pragma omp simd` is safe without OpenMP**: unknown pragmas are silently ignored
+by standard-conforming C++17 compilers.  All major compilers recognize the `omp`
+namespace.  The pragma is present in `buildMaskPrefixSums` for portability; GCC 13.3
+auto-vectorizes the loops correctly even without it.
+
+**`#pragma omp simd` defeated by hardware POPCNT**: under `-mavx2`, GCC replaces
+software Hamming-weight expressions with scalar `popcntl` and ignores the simd pragma.
+The `popcount32` + `#pragma omp simd` approach in `vbm_host_cuda_kernels.cu` (sanity
+bench) requires `-mno-popcnt` to vectorize, which is unsuitable for production.
+`buildMaskPrefixSums` avoids this by not using popcount at all.
+
+---
+
+## 9. No __hostdev__ on buildMaskPrefixSums — Deliberate Decision
+
+`buildMaskPrefixSums` is CPU-only.  A CUDA equivalent would be organized around the
+32-thread warp (using `__ballot_sync`, warp-level prefix intrinsics, or cooperative
+group reductions) and would look fundamentally different.  Marking it `__hostdev__`
+would be misleading about intended usage and would invite incorrect porting.
+
+The CUDA `decodeInverseMaps` already has its own highly optimized implementation.
+The CPU and CUDA decode paths are expected to remain separate implementations
+indefinitely.
+
+---
+
+## 10. Two Essential VBM Invariants (Not Enforced by NanoVDB Itself)
+
+These invariants are not structurally guaranteed by the NanoVDB data format, but all
+reasonable ways of constructing a VBM-compatible grid enforce them, and the VBM
+algorithms depend on them for correctness and for simplifying early-exit logic.
+
+**Invariant A — No empty leaves**: every leaf node in a VBM grid has at least one active
+voxel (`leafValueCount > 0`).  This is essential for the JumpMap encoding to give an
+unambiguous leafID per block: a leaf with zero active voxels would contribute nothing to
+the sequential voxel index but would still occupy a slot in the leaf array, breaking the
+1-to-1 mapping between sequential voxel ranges and leaf IDs.
+
+**Invariant B — Monotonically non-decreasing voxelID → leafID mapping, step ≤ 1**: the
+global sequential active voxel index increases monotonically across leaves, and consecutive
+leaves' active voxel ranges are contiguous (no gaps, no overlaps).  Formally:
+  `leafFirstOffset[k+1] = leafFirstOffset[k] + leafValueCount[k]`
+This means active voxel ranges partition the global index space without gaps.
+
+**Consequences for `decodeInverseMaps`**:
+- The check `leafValueCount == 0` is dead code and can be omitted (Invariant A).
+- The check `leafFirstOffset + leafValueCount <= blockFirstOffset` (leaf entirely before
+  block) is impossible for any leaf in the iteration range, given that `firstLeafID` is
+  chosen to be the leaf containing `blockFirstOffset` (Invariant B + correct firstLeafID).
+- When `leafFirstOffset >= blockFirstOffset + BlockWidth` fires for one leaf, it holds for
+  all subsequent leaves (Invariant B), so `break` is correct instead of `continue`.
+- The early-exit collapses to a single line:
+  `if (leafFirstOffset >= blockFirstOffset + BlockWidth) break;`
+
+---
+
+## 11. decodeInverseMaps CPU — Threading and Calling Philosophy
+
+`decodeInverseMaps` (CPU) is **intentionally single-threaded and stateless**.  It decodes
+exactly one voxel block per call.  The caller (`buildVoxelBlockManager` CPU path) distributes
+blocks across threads via `util::forEach` (TBB or `std::thread`).  There is no OpenMP thread
+parallelism inside the function — `#pragma omp simd` is present only to hint the compiler
+toward SIMD vectorization, not to spawn threads.
+
+**Consequence for callers**: the output arrays `leafIndex[BlockWidth]` and
+`voxelOffset[BlockWidth]` must be caller-allocated.  The natural pattern is stack allocation
+inside the `util::forEach` lambda, one array per block per thread.  Heap allocation (e.g.,
+`new[]` per block) is wasteful; thread-local storage is an option but introduces alignment
+complications (see §12).
+
+**Contrast with GPU**: the GPU `decodeInverseMaps` is cooperative — a full CUDA thread block
+(up to 512 threads) decodes one voxel block together, writing to shared memory.  On CPU, one
+thread decodes one block sequentially but with SIMD across the 512 voxel positions.
+
+---
+
+## 12. Why mPrefixSum Is Not Used in decodeInverseMaps
+
+`LeafIndexBase::mPrefixSum` encodes per-word exclusive cumulative popcounts as 7 × 9-bit
+packed fields (see §2).  It is designed for **random access** — locating the global sequential
+index of a single voxel in O(1) without scanning the whole mask.
+
+`decodeInverseMaps` needs the full 512-entry prefix-sum table for every active voxel in the
+leaf.  For this **bulk sequential access**, recomputing per-word popcounts from scratch via the
+`buildMaskPrefixSums` SIMD algorithm (§3) is cheaper than unpacking the 9-bit fields.
+Specifically:
+- Unpacking 7 × 9-bit fields requires masked shifts and is awkward to vectorize.
+- `buildMaskPrefixSums` operates on the raw 8 `uint64_t` words, uses only shifts and adds, and
+  vectorizes cleanly to 8 AVX2 `vpaddq`/`vpsllq` pairs.
+- At 25% occupancy (typical level-set narrow band), `buildMaskPrefixSums` consumes a small
+  fraction of the total per-leaf work.
+
+`mPrefixSum` is still used indirectly: it is the `prefixSum` argument passed to
+`buildMaskPrefixSums` to fold in the cross-word exclusive offsets (Step 5 of the algorithm).
+What is bypassed is the idea of using `mPrefixSum` alone (without a full table build) to compute
+individual voxel offsets one at a time.
+
+---
+
+## 13. Output Fill Structure — Range Fill + Contiguous Copy, Not Scatter
+
+The CPU `decodeInverseMaps` fills its output arrays per leaf as:
+
+```
+leafIndex[pStart..pEnd)   = leafID             (range fill, constant value)
+voxelOffset[pStart..pEnd) = leafLocalOffsets[jStart..jStart+(pEnd-pStart))  (contiguous copy)
+```
+
+**Why contiguous copy (not scatter)**: `shuffleDownMask` produces `leafLocalOffsets` as a
+compacted array of active local positions in ascending order — position 0 holds the 0-th active
+voxel's leaf-local offset, position 1 holds the 1-st, etc.  Since the block's output range
+`[pStart, pEnd)` maps to a contiguous slice `[jStart, jStart+len)` of this compacted array, a
+single `std::copy` (or `memcpy`) suffices.  No scatter is needed.
+
+**`std::fill`/`std::copy` vectorization caveat**: if the output pointers are stack-allocated or
+derived from TLS, the compiler may not prove alignment and fall back to scalar or SSE stores even
+under `-mavx2`.  For the sentinel-fill initialization (filling `UnusedLeafIndex` /
+`UnusedVoxelOffset` before the leaf loop), this can dominate runtime.  Measurements on the test
+machine showed `std::fill` via TLS taking ~1.5 ms vs ~0.22 ms with explicit `_mm256_store_si256`
+over `alignas(64)` arrays (16384 blocks, 32 threads).  Stack-allocate the output arrays with
+`alignas(64)` and replace `std::fill` with explicit AVX2 stores when performance matters.
diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VoxelBlockManagerContext.md b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VoxelBlockManagerContext.md
new file mode 100644
index 0000000000..1ef2ab9bfb
--- /dev/null
+++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VoxelBlockManagerContext.md
@@ -0,0 +1,259 @@
+# VoxelBlockManager: Context and Design Guide
+
+This document captures the intent, semantics, and usage patterns of the
+VoxelBlockManager (VBM) acceleration structure for NanoVDB OnIndex grids.
+It is intended to give a complete mental model to anyone (human or AI assistant)
+working on VBM development or writing SIMT kernels that consume it.
+
+---
+
+## What problem does the VBM solve?
+
+A NanoVDB `ValueOnIndex` (OnIndex) grid assigns a unique sequential integer
+index to each active voxel. These indices are dense: if the grid has N active
+voxels, they are numbered 1..N (the index 0 is reserved). This sequential
+layout enables efficient SIMT parallelism: a GPU kernel can launch one thread
+per active voxel, with thread k processing voxel k.
+
+The challenge is *decoding*: given a voxel's sequential index, how do you find
+its 3D coordinates and leaf node? The index space is flat but the tree is
+hierarchical. Scanning the tree for each voxel is too expensive.
+
+The VBM solves this by precomputing two small metadata arrays that let any
+thread, given only its sequential index, quickly determine which leaf node
+contains that voxel and where within the leaf it lives. This decode is done
+cooperatively at thread-block granularity (one thread block per "voxel block"),
+using shared memory.
+
+---
+
+## Core concepts
+
+### Voxel blocks
+
+The VBM partitions the active voxel index space into contiguous spans of
+`BlockWidth` voxels called *voxel blocks*. `BlockWidth` is a compile-time
+power of two (typically 128). Block `b` covers sequential indices:
+
+    [firstOffset + b * BlockWidth,  firstOffset + (b+1) * BlockWidth - 1]
+
+A GPU kernel processes one voxel block per thread block, with `BlockWidth`
+threads. Thread `t` in block `b` is responsible for voxel index
+`firstOffset + b * BlockWidth + t`.
+
+### firstOffset and lastOffset
+
+`firstOffset` is the base of the VBM's index range. It must satisfy
+`firstOffset == 1 (mod BlockWidth)` (i.e., it is "block-aligned"). For a
+single-GPU build covering the full grid, `firstOffset = 1` and
+`lastOffset = activeVoxelCount()`.
+
+In a **multi-GPU** setting, each rank owns a contiguous slice of the active
+voxel index space. Rank r uses `firstOffset_r` and `lastOffset_r` to build a
+VBM covering only its slice, even though all ranks hold a copy of the full
+grid topology. The VBM metadata is then sized and indexed relative to each
+rank's own `firstOffset`.
+
+### blockCount
+
+`blockCount` is the **allocated capacity** of the VBM metadata buffers. It
+must be >= `ceil((lastOffset - firstOffset + 1) / BlockWidth)` but may be
+larger. This allows pre-allocating a larger handle and rebuilding in-place
+for a range that grows over time, without reallocating.
+
+---
+
+## Metadata arrays
+
+### firstLeafID  (uint32_t[blockCount])
+
+`firstLeafID[b]` is the index of the **first leaf node** that overlaps voxel
+block `b`. A leaf overlaps block `b` if any of its active voxels fall in
+`[firstOffset + b*BlockWidth, firstOffset + (b+1)*BlockWidth - 1]`.
+
+In a sequential NanoVDB grid, leaf nodes are laid out contiguously in memory
+in ascending order of their first active voxel index. So `firstLeafID[b]` is
+the smallest leaf index `i` such that leaf `i` has at least one active voxel
+in block `b`.
+
+### jumpMap  (uint64_t[blockCount * JumpMapLength])
+
+`JumpMapLength = BlockWidth / 64`. The jumpMap for block `b` is a bitfield of
+`BlockWidth` bits (stored as `JumpMapLength` uint64_t words) where bit `p` is
+set if and only if a new leaf node **begins** at position `p` within block `b`
+(i.e., some leaf's first active voxel has sequential index
+`firstOffset + b*BlockWidth + p`, and `p > 0`).
+
+Bit 0 is never set (the leaf starting exactly at the block boundary is
+recorded in `firstLeafID`, not the jumpMap).
+
+Together, `firstLeafID[b]` and the jumpMap for block `b` enumerate all leaf
+nodes that overlap block `b`: start at `firstLeafID[b]`, then count the set
+bits in the jumpMap to find how many additional leaves follow.
+
+---
+
+## Build API
+
+### Handle: VoxelBlockManagerHandle<BufferT>
+
+A pure data holder. Owns two buffers (`firstLeafID` and `jumpMap`) and stores
+`blockCount`, `firstOffset`, `lastOffset`. No allocation or build logic is in
+the handle itself. A default-constructed handle is the canonical "null/empty"
+state (`blockCount == 0`).
+
+Accessors:
+- `blockCount()`, `firstOffset()`, `lastOffset()`
+- `hostFirstLeafID()` / `hostJumpMap()` -- host-side pointers
+- `deviceFirstLeafID()` / `deviceJumpMap()` -- device-side pointers (only when
+  BufferT has a device dual, e.g. `cuda::DeviceBuffer` or a unified buffer)
+
+### Two-overload pattern
+
+Both the CPU (`nanovdb::tools::`) and CUDA (`nanovdb::tools::cuda::`) build
+functions follow the same two-overload pattern, mirroring the NodeManager
+convention in NanoVDB:
+
+**Allocating overload** -- returns a new, fully-constructed handle:
+
+    // CPU
+    auto handle = nanovdb::tools::buildVoxelBlockManager<Log2BlockWidth>(grid);
+
+    // CUDA
+    auto handle = nanovdb::tools::cuda::buildVoxelBlockManager<Log2BlockWidth>(d_grid, stream);
+
+Optional parameters (with sentinel 0 meaning "derive from grid"):
+- `firstOffset` -- defaults to 1
+- `lastOffset`  -- defaults to `activeVoxelCount()` (CPU) or read from device
+                   via `DeviceGridTraits::getActiveVoxelCount()` (CUDA)
+- `nBlocks`     -- defaults to `ceil((lastOffset - firstOffset + 1) / BlockWidth)`
+
+If `lastOffset < firstOffset` (e.g., empty grid), a default-constructed null
+handle is returned immediately with no allocation attempted.
+
+**Rebuild-in-place overload** -- takes a pre-allocated handle by reference,
+zeroes the jumpMap, and recomputes both arrays. No allocation:
+
+    // CPU
+    nanovdb::tools::buildVoxelBlockManager<Log2BlockWidth>(grid, handle);
+
+    // CUDA
+    nanovdb::tools::cuda::buildVoxelBlockManager<Log2BlockWidth>(d_grid, handle, stream);
+
+A null handle (`blockCount == 0`) is silently ignored (no-op). This overload
+is the right choice for benchmarking or for rebuilding after a
+topology-preserving update without paying allocation cost.
+
+The allocating overload delegates to the rebuild overload after allocating
+buffers -- no logic duplication.
+
+---
+
+## CPU vs GPU implementation asymmetry
+
+### GPU: launch at lower-node granularity
+
+A NanoVDB level-1 internal node ("lower node") has up to 4096 leaf child
+slots (16^3). The GPU kernel launches one CTA per lower node (subdivided
+into `SlicesPerLowerNode = 8` slices for additional parallelism), so each
+thread handles approximately one leaf child slot per iteration. Threads check
+the lower node's `childMask` to skip empty slots.
+
+This grouping is chosen because:
+- It naturally sizes the CTA workload (4096 slots / 8 slices / 128 threads = 4 slots per thread)
+- Threads in the same warp access leaves from the same lower node, improving
+  memory access locality
+- The grid is `<<<dim3(lowerCount, SlicesPerLowerNode), MaxThreadsPerBlock>>>`
+
+The cost is wasted threads for sparse lower nodes (few active leaf children
+out of 4096 slots). This is an acceptable trade-off on the GPU.
+
+### CPU: iterate leaves directly
+
+On the CPU there is no benefit to the lower-node grouping. The build uses
+`std::for_each(std::execution::par, firstLeaf, firstLeaf + leafCount, ...)`,
+iterating directly over the flat contiguous leaf array. Each task processes
+exactly one leaf -- no child mask checks, no wasted iterations.
+
+Leaf index is computed by pointer arithmetic: `&leaf - firstLeaf`.
+
+### Why firstLeafID writes are race-free
+
+In a sequential NanoVDB grid, leaf offset ranges are non-overlapping and
+ordered. A leaf that spans from block `a` to block `a+k` (max k=3 for
+BlockWidth=128 and leaves with <=512 active voxels) "backward-fills"
+`firstLeafID[a+1..a+k]`. No other leaf can start in a block before `a+k`
+without its offset range overlapping leaf `a`'s range -- which is impossible.
+Hence at most one leaf writes each `firstLeafID[b]` entry, so the writes
+are non-atomic.
+
+The jumpMap writes, by contrast, require atomic OR because multiple leaves
+from different parts of the tree can start at positions within the same block.
+
+---
+
+## Kernel usage pattern (SIMT consumer)
+
+The typical VBM-powered kernel:
+
+    __global__ void myKernel(
+        NanoGrid<ValueOnIndex>* grid,
+        const uint32_t* firstLeafID,
+        const uint64_t* jumpMap,
+        uint64_t firstOffset, uint64_t lastOffset, uint32_t nBlocks)
+    {
+        __shared__ uint32_t smem_leafIndex[BlockWidth];
+        __shared__ uint16_t smem_voxelOffset[BlockWidth];
+
+        int blockID = blockIdx.x;
+        uint64_t blockFirstOffset = firstOffset + (uint64_t)blockID * BlockWidth;
+
+        // Cooperative decode: all threads in the block participate
+        VoxelBlockManager<Log2BlockWidth>::decodeInverseMaps(
+            grid,
+            firstLeafID[blockID],
+            &jumpMap[blockID * VoxelBlockManager<Log2BlockWidth>::JumpMapLength],
+            blockFirstOffset,
+            smem_leafIndex,
+            smem_voxelOffset);
+        // smem_leafIndex[t] and smem_voxelOffset[t] now hold the leaf index
+        // and intra-leaf voxel offset for thread t's voxel.
+        // Entries beyond lastOffset are filled with UnusedLeafIndex / UnusedVoxelOffset.
+
+        int t = threadIdx.x;
+        uint64_t globalIndex = blockFirstOffset + t;
+        if (globalIndex > lastOffset) return;
+        if (smem_leafIndex[t] == VoxelBlockManager<Log2BlockWidth>::UnusedLeafIndex) return;
+
+        // From here: access the voxel's 3D position, values, or stencil.
+        // VoxelBlockManager<Log2BlockWidth>::computeBoxStencil(...) uses the same
+        // smem arrays to look up the 27 stencil neighbor indices.
+    }
+
+    // Launch: one block per VBM block
+    myKernel<<<nBlocks, BlockWidth>>>(
+        d_grid,
+        vbmHandle.deviceFirstLeafID(),
+        vbmHandle.deviceJumpMap(),
+        vbmHandle.firstOffset(),
+        vbmHandle.lastOffset(),
+        (uint32_t)vbmHandle.blockCount());
+
+Key points:
+- `decodeInverseMaps` must be called by ALL threads in the block
+  (it uses `__syncthreads` internally). Do not call from divergent threads.
+- `computeBoxStencil` does NOT synchronize and may be called per-thread.
+- Voxels in the last partial block beyond `lastOffset` get sentinel values
+  (`UnusedLeafIndex = 0xffffffff`, `UnusedVoxelOffset = 0xffff`); always
+  guard with a `globalIndex <= lastOffset` check.
+
+---
+
+## Files
+
+- `nanovdb/tools/VoxelBlockManager.h`      -- Handle class, CPU build functions
+- `nanovdb/tools/cuda/VoxelBlockManager.cuh` -- VoxelBlockManager device struct,
+                                                CUDA build functions
+- `nanovdb/examples/ex_voxelBlockManager_host_cuda/` -- end-to-end example,
+  benchmarks CPU vs GPU build time, validates CPU/GPU metadata agreement,
+  validates full inverse map against grid structure
diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda.cpp b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda.cpp
new file mode 100644
index 0000000000..2ebc9370e0
--- /dev/null
+++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda.cpp
@@ -0,0 +1,96 @@
+// Copyright Contributors to the OpenVDB Project
+// SPDX-License-Identifier: Apache-2.0
+
+/*!
+    \file vbm_host_cuda.cpp
+
+    \brief Test harness for the VoxelBlockManager (CUDA reference implementation).
+
+    Generates a random sparse domain at a configurable occupancy level using a
+    Morton-curve layout, builds a ValueOnIndex NanoVDB grid, constructs the
+    VoxelBlockManager on the GPU, decodes the full inverse map (leafIndex,
+    voxelOffset per active voxel), and validates the result on the host.
+
+    Usage: vbm_host_cuda [ambient_voxels [occupancy]]
+      ambient_voxels  Total universe of voxel positions  (default: 1048576)
+      occupancy       Fraction of positions that are active in [0,1]  (default: 0.5)
+*/
+
+#include <nanovdb/NanoVDB.h>
+
+#include <random>
+#include <string>
+#include <vector>
+#include <iostream>
+#include <stdexcept>
+
+void runVBMCudaTest(const std::vector<nanovdb::Coord>& coords);
+
+/// @brief Unpack one component of a Morton-encoded index into a coordinate.
+/// Keeps every third bit of the input, then packs them into a contiguous integer.
+static uint32_t coordinate_bitpack(uint32_t x)
+{
+    x &= 0x49249249;
+    x |= (x >>  2); x &= 0xc30c30c3;
+    x |= (x >>  4); x &= 0x0f00f00f;
+    x |= (x >>  8); x &= 0xff0000ff;
+    x |= (x >> 16); x &= 0x0000ffff;
+    return x;
+}
+
+/// @brief Generate active voxel coordinates at the requested occupancy level.
+/// Voxels are drawn uniformly at random from a Morton-curve layout over
+/// ambient_voxels positions, giving spatially coherent 3D coordinates.
+static std::vector<nanovdb::Coord>
+generateDomain(int ambient_voxels, float occupancy, uint32_t seed = 42)
+{
+    const int target = (int)(occupancy * (float)ambient_voxels);
+
+    std::mt19937 rng(seed);
+    std::uniform_int_distribution<int> dist(0, ambient_voxels - 1);
+
+    std::vector<bool> voxmap(ambient_voxels, false);
+    int active = 0;
+    while (active < target) {
+        int i = dist(rng);
+        if (!voxmap[i]) { voxmap[i] = true; ++active; }
+    }
+
+    std::vector<nanovdb::Coord> coords;
+    coords.reserve(active);
+    for (int i = 0; i < ambient_voxels; ++i) {
+        if (voxmap[i]) {
+            coords.emplace_back(
+                (int)coordinate_bitpack( i         & 0x49249249),
+                (int)coordinate_bitpack((i >>  1)  & 0x49249249),
+                (int)coordinate_bitpack((i >>  2)  & 0x49249249));
+        }
+    }
+    return coords;
+}
+
+int main(int argc, char** argv)
+{
+    try {
+        int   ambient_voxels = 8 * 1024 * 1024;
+        float occupancy      = 0.25f;
+
+        if (argc > 1) ambient_voxels = std::stoi(argv[1]);
+        if (argc > 2) occupancy      = std::stof(argv[2]);
+
+        occupancy = std::max(0.0f, std::min(1.0f, occupancy));
+
+        std::cout << "ambient_voxels = " << ambient_voxels << "\n"
+                  << "occupancy      = " << occupancy      << "\n";
+
+        auto coords = generateDomain(ambient_voxels, occupancy);
+        std::cout << "Active voxels generated: " << coords.size() << "\n";
+
+        runVBMCudaTest(coords);
+    }
+    catch (const std::exception& e) {
+        std::cerr << "An exception occurred: \"" << e.what() << "\"\n";
+        return 1;
+    }
+    return 0;
+}
diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda_kernels.cu b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda_kernels.cu
new file mode 100644
index 0000000000..91af736e9c
--- /dev/null
+++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda_kernels.cu
@@ -0,0 +1,398 @@
+// Copyright Contributors to the OpenVDB Project
+// SPDX-License-Identifier: Apache-2.0
+
+/*!
+    \file vbm_host_cuda_kernels.cu
+
+    \brief CUDA implementation for the VoxelBlockManager test harness.
+
+    Builds the VoxelBlockManager from a ValueOnIndex grid, decodes the full
+    inverse map (leafIndex, voxelOffset) for all active voxels, downloads the
+    result to the host, and validates it against the grid structure.
+    Also benchmarks CPU vs GPU VBM construction with repeated timed runs.
+*/
+
+#include <nanovdb/NanoVDB.h>
+#include <nanovdb/HostBuffer.h>
+#include <nanovdb/tools/cuda/PointsToGrid.cuh>
+#include <nanovdb/tools/VoxelBlockManager.h>
+#include <nanovdb/util/ForEach.h>
+#include <nanovdb/util/Timer.h>
+#include <nanovdb/util/cuda/Timer.h>
+#include <nanovdb/util/cuda/Util.h>
+
+#include <thrust/universal_vector.h>
+
+
+#include <algorithm>
+#include <atomic>
+#include <vector>
+#include <iostream>
+
+namespace {
+
+static constexpr int Log2BlockWidth = 7;
+static constexpr int BlockWidth     = 1 << Log2BlockWidth; // 128
+
+using VBM    = nanovdb::tools::cuda::VoxelBlockManager<Log2BlockWidth>;
+using CPUVBM = nanovdb::tools::VoxelBlockManager<Log2BlockWidth>;
+
+
+/// @brief For each VBM block, decode the inverse map and store
+/// (leafIndex, voxelOffset) for every active voxel into global output arrays
+/// indexed by (globalVoxelOffset - firstOffset).
+/// Launch configuration: <<<nBlocks, BlockWidth>>>
+__global__ void decodeAllBlocksKernel(
+    nanovdb::NanoGrid<nanovdb::ValueOnIndex>* grid,
+    const uint32_t*                           firstLeafID,
+    const uint64_t*                           jumpMap,
+    uint64_t                                  firstOffset,
+    uint64_t                                  lastOffset,
+    uint32_t                                  nBlocks,
+    uint32_t*                                 outLeafIndex,
+    uint16_t*                                 outVoxelOffset)
+{
+    __shared__ uint32_t smem_leafIndex[BlockWidth];
+    __shared__ uint16_t smem_voxelOffset[BlockWidth];
+
+    uint32_t bID = blockIdx.x;
+    if (bID >= nBlocks) return;
+
+    uint64_t blockFirstOffset = firstOffset + (uint64_t)bID * BlockWidth;
+
+    VBM::decodeInverseMaps(
+        grid,
+        firstLeafID[bID],
+        &jumpMap[(uint64_t)bID * VBM::JumpMapLength],
+        blockFirstOffset,
+        smem_leafIndex,
+        smem_voxelOffset);
+
+    int      tID       = threadIdx.x;
+    uint64_t globalIdx = blockFirstOffset + tID;
+
+    if (globalIdx <= lastOffset && smem_leafIndex[tID] != VBM::UnusedLeafIndex) {
+        uint64_t k          = globalIdx - firstOffset; // 0-based
+        outLeafIndex[k]     = smem_leafIndex[tID];
+        outVoxelOffset[k]   = smem_voxelOffset[tID];
+    }
+}
+
+/// @brief For each VBM block, decode the inverse map into shared memory and
+/// write a token value per block to dummyOut to prevent dead code elimination.
+/// No per-voxel output is written to global memory.
+/// Launch configuration: <<<nBlocks, BlockWidth>>>
+__global__ void benchDecodeKernel(
+    nanovdb::NanoGrid<nanovdb::ValueOnIndex>* grid,
+    const uint32_t*                           firstLeafID,
+    const uint64_t*                           jumpMap,
+    uint64_t                                  firstOffset,
+    uint64_t                                  lastOffset,
+    uint32_t                                  nBlocks,
+    uint32_t*                                 dummyOut)
+{
+    __shared__ uint32_t smem_leafIndex[BlockWidth];
+    __shared__ uint16_t smem_voxelOffset[BlockWidth];
+
+    uint32_t bID = blockIdx.x;
+    if (bID >= nBlocks) return;
+
+    uint64_t blockFirstOffset = firstOffset + (uint64_t)bID * BlockWidth;
+
+    VBM::decodeInverseMaps(
+        grid,
+        firstLeafID[bID],
+        &jumpMap[(uint64_t)bID * VBM::JumpMapLength],
+        blockFirstOffset,
+        smem_leafIndex,
+        smem_voxelOffset);
+
+    // Token write from thread 0 only to prevent dead code elimination
+    if (threadIdx.x == 0) {
+        dummyOut[bID] =
+            (smem_leafIndex[0]  != VBM::UnusedLeafIndex   ? 1u : 0u) +
+            (smem_voxelOffset[0] != VBM::UnusedVoxelOffset ? 1u : 0u);
+    }
+}
+
+} // anonymous namespace
+
+void runVBMCudaTest(const std::vector<nanovdb::Coord>& coords)
+{
+    const uint64_t nCoords = coords.size();
+
+    // --- Build ValueOnIndex grid on GPU ---
+    nanovdb::Coord* d_coords = nullptr;
+    cudaCheck(cudaMalloc(&d_coords, nCoords * sizeof(nanovdb::Coord)));
+    cudaCheck(cudaMemcpy(d_coords, coords.data(),
+        nCoords * sizeof(nanovdb::Coord), cudaMemcpyHostToDevice));
+
+    auto handle = nanovdb::tools::cuda::voxelsToGrid<nanovdb::ValueOnIndex>(
+        d_coords, nCoords);
+    cudaCheck(cudaFree(d_coords));
+
+    auto* d_grid = handle.deviceGrid<nanovdb::ValueOnIndex>();
+    if (!d_grid) throw std::runtime_error("Failed to create device grid");
+
+    // Download grid to host for validation and CPU build
+    handle.deviceDownload();
+    auto* h_grid = handle.grid<nanovdb::ValueOnIndex>();
+    if (!h_grid) throw std::runtime_error("Failed to download host grid");
+
+    const auto&    tree    = h_grid->tree();
+    const uint64_t nVoxels = h_grid->activeVoxelCount();
+    const uint32_t nBlocks = (uint32_t)((nVoxels + BlockWidth - 1) >> Log2BlockWidth);
+
+    std::cout << "Active voxels (unique): " << nVoxels          << "\n"
+              << "Leaf nodes            : " << tree.nodeCount(0) << "\n"
+              << "Lower nodes           : " << tree.nodeCount(1) << "\n"
+              << "Upper nodes           : " << tree.nodeCount(2) << "\n"
+              << "VBM blocks            : " << nBlocks
+              <<     "  (BlockWidth=" << BlockWidth << ")\n\n";
+
+    // --- Benchmark VBM construction: GPU vs CPU ---
+    // Allocate handles once; timing runs reuse the buffers (memset + kernel only,
+    // no allocation overhead). First run per device serves as warmup - important
+    // for unified-memory buffers where the first access triggers page migration.
+    static constexpr int nRuns = 5;
+
+    auto gpuHandle = nanovdb::tools::cuda::buildVoxelBlockManager<Log2BlockWidth>(d_grid);
+    auto cpuHandle = nanovdb::tools::buildVoxelBlockManager<Log2BlockWidth>(h_grid);
+
+    // GPU build (cudaMemsetAsync + kernel, pre-allocated buffers)
+    {
+        float minMs = std::numeric_limits<float>::max();
+        for (int i = 0; i < nRuns; ++i) {
+            cudaCheck(cudaDeviceSynchronize()); // ensure stream is idle before timing
+            nanovdb::util::cuda::Timer gpuTimer;
+            nanovdb::tools::cuda::buildVoxelBlockManager<Log2BlockWidth>(d_grid, gpuHandle);
+            float ms = gpuTimer.elapsed(); // records stop event and synchronizes
+            if (i > 0) minMs = std::min(minMs, ms);
+        }
+        std::cout << "GPU buildVoxelBlockManager (memset+kernel): min " << minMs
+                  << " ms  (over " << nRuns-1 << " post-warmup runs)\n";
+    }
+
+    // CPU build (std::memset + std::for_each(par), pre-allocated buffers)
+    {
+        float minMs = std::numeric_limits<float>::max();
+        for (int i = 0; i < nRuns; ++i) {
+            nanovdb::util::Timer cpuTimer;
+            cpuTimer.start("");
+            nanovdb::tools::buildVoxelBlockManager<Log2BlockWidth>(h_grid, cpuHandle);
+            float ms = (float)cpuTimer.elapsed<std::chrono::microseconds>() / 1000.0f;
+            if (i > 0) minMs = std::min(minMs, ms);
+        }
+        std::cout << "CPU buildVoxelBlockManager (memset+forEachPar): min " << minMs
+                  << " ms  (over " << nRuns-1 << " post-warmup runs)\n\n";
+    }
+
+    // --- Validate CPU build against GPU build ---
+    // Download GPU metadata to host and compare byte-for-byte with the CPU handle.
+    {
+        const uint64_t firstLeafIDBytes = gpuHandle.blockCount() * sizeof(uint32_t);
+        const uint64_t jumpMapBytes     = gpuHandle.blockCount() * (BlockWidth / 64) * sizeof(uint64_t);
+
+        std::vector<uint32_t> gpuFirstLeafID(gpuHandle.blockCount());
+        std::vector<uint64_t> gpuJumpMap(gpuHandle.blockCount() * (BlockWidth / 64));
+
+        cudaCheck(cudaMemcpy(gpuFirstLeafID.data(), gpuHandle.deviceFirstLeafID(),
+            firstLeafIDBytes, cudaMemcpyDeviceToHost));
+        cudaCheck(cudaMemcpy(gpuJumpMap.data(), gpuHandle.deviceJumpMap(),
+            jumpMapBytes, cudaMemcpyDeviceToHost));
+
+        const bool firstLeafIDMatch = std::memcmp(gpuFirstLeafID.data(),
+            cpuHandle.hostFirstLeafID(), firstLeafIDBytes) == 0;
+        const bool jumpMapMatch = std::memcmp(gpuJumpMap.data(),
+            cpuHandle.hostJumpMap(), jumpMapBytes) == 0;
+
+        if (firstLeafIDMatch && jumpMapMatch)
+            std::cout << "CPU/GPU metadata match: PASSED\n";
+        else
+            std::cerr << "CPU/GPU metadata match: FAILED"
+                      << (firstLeafIDMatch ? "" : " [firstLeafID mismatch]")
+                      << (jumpMapMatch     ? "" : " [jumpMap mismatch]") << "\n";
+    }
+
+    // gpuHandle is the last-built GPU VBM; use it for decode/validation below
+    auto& vbmHandle = gpuHandle;
+
+    // --- Decode all blocks into unified memory ---
+    // Using thrust::universal_vector so the kernel writes directly into host-accessible
+    // memory, avoiding a separate cudaMemcpy download step.
+    thrust::universal_vector<uint32_t> outLeafIndex(nVoxels);
+    thrust::universal_vector<uint16_t> outVoxelOffset(nVoxels);
+
+    decodeAllBlocksKernel<<<nBlocks, BlockWidth>>>(
+        d_grid,
+        vbmHandle.deviceFirstLeafID(),
+        vbmHandle.deviceJumpMap(),
+        vbmHandle.firstOffset(), vbmHandle.lastOffset(), nBlocks,
+        outLeafIndex.data().get(), outVoxelOffset.data().get());
+    cudaCheckError();
+    cudaCheck(cudaDeviceSynchronize());
+
+    // --- Validate GPU decodeInverseMaps on host ---
+    // For each active voxel index k+1, the decoded (leafIndex, voxelOffset)
+    // must satisfy: leaf[leafIndex].getValue(voxelOffset) == k+1
+    {
+        const auto* firstLeaf = tree.getFirstNode<0>();
+        uint64_t    errors    = 0;
+
+        for (uint64_t k = 0; k < nVoxels; ++k) {
+            const uint32_t leafIdx     = outLeafIndex[k];
+            const uint16_t voxelOff    = outVoxelOffset[k];
+            const uint64_t expectedIdx = k + vbmHandle.firstOffset();
+            const uint64_t decodedIdx  = firstLeaf[leafIdx].getValue(voxelOff);
+
+            if (decodedIdx != expectedIdx) {
+                if (errors < 5)
+                    std::cerr << "ERROR at k=" << k
+                              << ": expected index " << expectedIdx
+                              << ", decoded "        << decodedIdx
+                              << "  (leaf=" << leafIdx
+                              << " voxelOff=" << voxelOff << ")\n";
+                ++errors;
+            }
+        }
+
+        if (errors == 0)
+            std::cout << "GPU decodeInverseMaps: PASSED (all " << nVoxels
+                      << " entries validated)\n";
+        else
+            std::cerr << "GPU decodeInverseMaps: FAILED (" << errors << " / "
+                      << nVoxels << " entries incorrect)\n";
+    }
+
+    // --- Validate CPU decodeInverseMaps against GPU results ---
+    // cpuHandle metadata was already verified to agree with gpuHandle above.
+    // Run the CPU decode block-by-block and compare against the unified-memory
+    // GPU output (which is now host-accessible after the sync above).
+    {
+        const uint32_t* firstLeafIDPtr = cpuHandle.hostFirstLeafID();
+        const uint64_t* jumpMapPtr     = cpuHandle.hostJumpMap();
+
+        std::vector<uint32_t> cpuLeafIndex(BlockWidth);
+        std::vector<uint16_t> cpuVoxelOffset(BlockWidth);
+        uint64_t errors = 0;
+
+        for (uint32_t bID = 0; bID < nBlocks; ++bID) {
+            const uint64_t blockFirstOffset =
+                cpuHandle.firstOffset() + (uint64_t)bID * BlockWidth;
+
+            CPUVBM::decodeInverseMaps(
+                h_grid,
+                firstLeafIDPtr[bID],
+                &jumpMapPtr[(uint64_t)bID * CPUVBM::JumpMapLength],
+                blockFirstOffset,
+                cpuLeafIndex.data(),
+                cpuVoxelOffset.data());
+
+            for (int tID = 0; tID < BlockWidth; ++tID) {
+                const uint64_t k = (uint64_t)bID * BlockWidth + tID;
+                if (k >= nVoxels) break;
+                if (cpuLeafIndex[tID] != outLeafIndex[k] ||
+                    cpuVoxelOffset[tID] != outVoxelOffset[k]) {
+                    if (errors < 5)
+                        std::cerr << "CPU/GPU decodeInverseMaps mismatch at k=" << k
+                                  << ": cpu=(" << cpuLeafIndex[tID] << ","
+                                  << cpuVoxelOffset[tID] << ") gpu=("
+                                  << outLeafIndex[k] << "," << outVoxelOffset[k]
+                                  << ")\n";
+                    ++errors;
+                }
+            }
+        }
+
+        if (errors == 0)
+            std::cout << "CPU decodeInverseMaps vs GPU: PASSED\n";
+        else
+            std::cerr << "CPU decodeInverseMaps vs GPU: FAILED ("
+                      << errors << " mismatches)\n";
+    }
+
+    // --- GPU decodeInverseMaps performance benchmark ---
+    // benchDecodeKernel decodes into shared memory only; one token uint32_t per
+    // block is written to dummyOut to prevent dead code elimination.
+    {
+        thrust::universal_vector<uint32_t> dummyOut(nBlocks, 0);
+
+        static constexpr int nPerfRuns = 20;
+
+        // Warmup launch (first kernel invocation may incur JIT / page-migration cost)
+        benchDecodeKernel<<<nBlocks, BlockWidth>>>(
+            d_grid,
+            vbmHandle.deviceFirstLeafID(), vbmHandle.deviceJumpMap(),
+            vbmHandle.firstOffset(), vbmHandle.lastOffset(), nBlocks,
+            dummyOut.data().get());
+        cudaCheck(cudaDeviceSynchronize());
+
+        std::cout << "\nGPU decodeInverseMaps performance ("
+                  << nBlocks << " blocks, " << nVoxels << " voxels):\n";
+
+        for (int run = 0; run < nPerfRuns; ++run) {
+            cudaCheck(cudaDeviceSynchronize());
+            nanovdb::util::cuda::Timer gpuTimer;
+            benchDecodeKernel<<<nBlocks, BlockWidth>>>(
+                d_grid,
+                vbmHandle.deviceFirstLeafID(), vbmHandle.deviceJumpMap(),
+                vbmHandle.firstOffset(), vbmHandle.lastOffset(), nBlocks,
+                dummyOut.data().get());
+            const float ms = gpuTimer.elapsed();
+            std::cout << "  run " << run << ": " << ms << " ms\n";
+        }
+
+        // Sum dummyOut on host (unified memory, already sync'd by last gpuTimer.elapsed())
+        uint64_t dummy = 0;
+        for (uint32_t i = 0; i < nBlocks; ++i) dummy += dummyOut[i];
+        std::cout << "  (dummy=" << dummy << ")\n";
+    }
+
+    // --- CPU decodeInverseMaps performance benchmark ---
+    // Scratch arrays are reused across all blocks so the only memory traffic is
+    // reading VBM metadata and leaf data -- not writing nVoxels of output.
+    // Token reads of scratch[0] after each block defeat dead code elimination
+    // without adding meaningful cost.
+    {
+        const uint32_t* firstLeafIDPtr = cpuHandle.hostFirstLeafID();
+        const uint64_t* jumpMapPtr     = cpuHandle.hostJumpMap();
+
+        static constexpr int nPerfRuns = 20;
+
+        std::atomic<uint32_t> dummy{0};
+
+        std::cout << "\nCPU decodeInverseMaps performance ("
+                  << nBlocks << " blocks, " << nVoxels << " voxels):\n";
+
+        for (int run = 0; run < nPerfRuns; ++run) {
+            nanovdb::util::Timer timer;
+            timer.start("");
+
+            nanovdb::util::forEach(0, nBlocks, 1,
+                [&](const nanovdb::util::Range1D& range) {
+                alignas(64) uint32_t scratchLeafIndex[BlockWidth];
+                alignas(64) uint16_t scratchVoxelOffset[BlockWidth];
+                for (auto bID = range.begin(); bID < range.end(); ++bID) {
+                    const uint64_t blockFirstOffset =
+                        cpuHandle.firstOffset() + (uint64_t)bID * BlockWidth;
+
+                    CPUVBM::decodeInverseMaps(
+                        h_grid,
+                        firstLeafIDPtr[bID],
+                        &jumpMapPtr[(uint64_t)bID * CPUVBM::JumpMapLength],
+                        blockFirstOffset,
+                        scratchLeafIndex,
+                        scratchVoxelOffset);
+
+                    dummy += (scratchLeafIndex[0]  != CPUVBM::UnusedLeafIndex)   ? 1u : 0u;
+                    dummy += (scratchVoxelOffset[0] != CPUVBM::UnusedVoxelOffset) ? 1u : 0u;
+                }
+            });
+
+            const float ms =
+                (float)timer.elapsed<std::chrono::microseconds>() / 1000.0f;
+            std::cout << "  run " << run << ": " << ms << " ms\n";
+        }
+        std::cout << "  (dummy=" << dummy.load() << ")\n";
+    }
+}
diff --git a/nanovdb/nanovdb/examples/ex_weno_nanovdb_cpu/weno_nanovdb_cpu.cpp b/nanovdb/nanovdb/examples/ex_weno_nanovdb_cpu/weno_nanovdb_cpu.cpp
new file mode 100644
index 0000000000..64d20debfe
--- /dev/null
+++ b/nanovdb/nanovdb/examples/ex_weno_nanovdb_cpu/weno_nanovdb_cpu.cpp
@@ -0,0 +1,530 @@
+// Copyright Contributors to the OpenVDB Project
+// SPDX-License-Identifier: Apache-2.0
+
+/*!
+    \file weno_nanovdb_cpu.cpp
+
+    \brief End-to-end CPU WENO5 norm-square-gradient on a narrow-band level
+           set, with a scalar reference for correctness validation.
+
+    End-to-end pipeline:
+
+      VBM decode -> per-batch sidecar value assembly -> out-of-band
+      sign-extrapolation -> SIMD Godunov WENO5 -> per-voxel |grad phi|^2
+      output sidecar.
+
+    Two passes run over the same .vdb input:
+
+      reference : per-voxel scalar nanovdb::math::WenoStencil<NanoGrid<float>>::normSqGrad.
+                  Tile values and in-leaf inactive values preserved through the
+                  OpenVDB -> NanoVDB conversion carry correctly-signed
+                  extrapolation "for free", matching our explicit extrapolate()
+                  semantics on in-the-band-typical topology.
+
+      fast      : WenoStencil::gatherIndices() -> per-tap SIMD load ->
+                  extrapolate() -> normSqGrad() -> per-lane scalar store.
+
+    Both passes write to the same-shape output buffer, keyed by ValueOnIndex
+    slot; a histogram of |outputRef - outputFast| follows.
+
+    Usage:
+      weno_nanovdb_cpu <path.vdb> [--grid=<name>]
+                                  [--threads=<n>]
+                                  [--skip-validation]
+
+    Build:
+      Configured via CMakeLists.txt in the parent examples/ directory.
+      Requires OpenVDB (for .vdb IO).  No CUDA.
+*/
+
+#include <nanovdb/NanoVDB.h>
+#include <nanovdb/HostBuffer.h>
+#include <nanovdb/tools/CreateNanoGrid.h>
+#include <nanovdb/tools/VoxelBlockManager.h>
+#include <nanovdb/util/ForEach.h>
+#include <nanovdb/util/Simd.h>
+#include <nanovdb/util/WenoStencil.h>
+#include <nanovdb/math/Stencils.h>      // scalar reference WenoStencil<GridT>
+
+#include <openvdb/openvdb.h>
+#include <openvdb/io/File.h>
+
+#include <algorithm>
+#include <array>
+#include <cmath>
+#include <cstdio>
+#include <cstdint>
+#include <cstring>
+#include <iostream>
+#include <memory>
+#include <numeric>
+#include <random>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+#include <nanovdb/util/Timer.h>
+#include <tbb/global_control.h>
+
+// ============================================================
+// Constants and type aliases
+// ============================================================
+
+static constexpr int Log2BlockWidth = 7;
+static constexpr int BlockWidth     = 1 << Log2BlockWidth;  // 128
+static constexpr int SIMDw          = 16;                   // float lane width
+
+using BuildT     = nanovdb::ValueOnIndex;
+using IndexGridT = nanovdb::NanoGrid<BuildT>;
+using LeafT      = nanovdb::NanoLeaf<BuildT>;
+using FloatGridT = nanovdb::NanoGrid<float>;
+using CPUVBM     = nanovdb::tools::VoxelBlockManager<Log2BlockWidth>;
+
+// ============================================================
+// VDB loading and NanoVDB conversion
+// ============================================================
+
+static openvdb::FloatGrid::Ptr
+loadFloatGridFromVdb(const std::string& path, const std::string& gridName)
+{
+    openvdb::io::File file(path);
+    file.open(false);  // delayed loading off
+
+    openvdb::GridBase::Ptr base;
+    if (!gridName.empty()) {
+        if (!file.hasGrid(gridName))
+            throw std::runtime_error("no grid named \"" + gridName + "\" in " + path);
+        base = file.readGrid(gridName);
+    } else {
+        openvdb::GridPtrVecPtr grids = file.getGrids();
+        for (auto& g : *grids) {
+            if (g && g->isType<openvdb::FloatGrid>()) { base = g; break; }
+        }
+        if (!base) throw std::runtime_error("no openvdb::FloatGrid found in " + path);
+    }
+    file.close();
+
+    auto floatGrid = openvdb::gridPtrCast<openvdb::FloatGrid>(base);
+    if (!floatGrid) throw std::runtime_error("grid is not an openvdb::FloatGrid");
+    return floatGrid;
+}
+
+/// NanoVDB conversion products shared across the two passes.
+/// - floatHandle  : NanoGrid<float> -- tile values + in-leaf inactive values
+///                  preserved verbatim, used by the scalar reference stencil.
+/// - indexHandle  : NanoGrid<ValueOnIndex> -- the topology-only index grid.
+/// - sidecar      : float sidecar (slot 0 = background, slots 1..N = active
+///                  voxel values in NanoVDB indexing order).
+struct ConvertedGrids {
+    nanovdb::GridHandle<nanovdb::HostBuffer>  floatHandle;
+    nanovdb::GridHandle<nanovdb::HostBuffer>  indexHandle;
+    std::vector<float>                        sidecar;
+};
+
+static ConvertedGrids
+convertFloatGrid(openvdb::FloatGrid& floatGrid)
+{
+    ConvertedGrids out;
+
+    // Direct OpenVDB -> NanoVDB float conversion.  No flags needed:
+    //   - tile values at internal nodes are stored directly (mTable);
+    //   - in-leaf inactive voxels share storage with active ones (dense 8^3).
+    // Both are preserved verbatim from the source grid.
+    out.floatHandle = nanovdb::tools::createNanoGrid<openvdb::FloatGrid>(floatGrid);
+
+    // Sidecar pipeline: build ValueOnIndex + float sidecar via the CreateNanoGrid builder.
+    nanovdb::tools::CreateNanoGrid<openvdb::FloatGrid> builder(floatGrid);
+    out.indexHandle = builder.template getHandle<
+        nanovdb::ValueOnIndex, nanovdb::HostBuffer>(
+            /*channels  =*/ 0u,
+            /*incStats  =*/ false,
+            /*incTiles  =*/ false);
+    out.sidecar.resize(builder.valueCount());
+    builder.template copyValues<nanovdb::ValueOnIndex>(out.sidecar.data());
+
+    // NanoVDB convention: slot 0 is the "not-found / background" sentinel.
+    // copyValues doesn't touch it.  Set it to the grid's background so the
+    // fast path can gather unconditionally (no per-lane branch on idx==0 during fill).
+    if (!out.sidecar.empty()) out.sidecar[0] = floatGrid.background();
+
+    return out;
+}
+
+// ============================================================
+// Reference pass -- scalar WenoStencil per active voxel
+// ============================================================
+//
+// Uses nanovdb::math::WenoStencil<NanoGrid<float>>.  Its moveTo(ijk)
+// populates 19 taps via the float grid's accessor; taps outside the
+// narrow band hit either in-leaf inactive slots (stored +-background by
+// the OpenVDB narrow-band builder) or tile values at internal nodes
+// (same convention).  Both cases yield correctly-signed extrapolated
+// values, matching the semantics of our explicit extrapolate().
+//
+// Cross-path indexing: both outputRef and outputFast are indexed by
+// the ValueOnIndex slot of a voxel.  For each active ijk we compute
+// normSqGrad on the float grid, then resolve its output slot via the
+// index grid's accessor.
+
+static double
+runReference(const FloatGridT&    floatGrid,
+             const IndexGridT&    indexGrid,
+             std::vector<float>&  outputRef)
+{
+    std::fill(outputRef.begin(), outputRef.end(), 0.f);
+
+    const uint32_t nLeaves = indexGrid.tree().nodeCount(0);
+
+    std::ostringstream sink;
+    nanovdb::util::Timer timer;
+    auto timeIt = [&](auto&& body) -> double {
+        timer.start("", sink);
+        body();
+        return static_cast<double>(timer.elapsed<std::chrono::microseconds>());
+    };
+
+    return timeIt([&] {
+        nanovdb::util::forEach(uint32_t(0), nLeaves, uint32_t(1),
+            [&](const nanovdb::util::Range1D& range) {
+                // One scalar stencil + one index accessor per TBB task.
+                nanovdb::math::WenoStencil<FloatGridT> stencil(floatGrid);
+                auto indexAcc = indexGrid.getAccessor();
+
+                const auto* firstFloatLeaf =
+                    floatGrid.tree().template getFirstNode<0>();
+                const auto* firstIndexLeaf =
+                    indexGrid.tree().template getFirstNode<0>();
+
+                for (uint32_t lid = range.begin(); lid != range.end(); ++lid) {
+                    // The two grids share topology, so leaf LID in the
+                    // index grid aligns with leaf LID in the float grid
+                    // (same order of insertion).  Iterate the index grid's
+                    // active voxels -- those are the slots we need to fill.
+                    const auto& indexLeaf = firstIndexLeaf[lid];
+                    (void)firstFloatLeaf;  // stencil.moveTo uses its own acc
+
+                    for (auto it = indexLeaf.beginValueOn(); it; ++it) {
+                        const nanovdb::Coord ijk = it.getCoord();
+                        stencil.moveTo(ijk);
+                        const float r = stencil.normSqGrad(/*iso=*/0.f);
+                        const uint64_t idx = indexAcc.getValue(ijk);
+                        outputRef[idx] = r;
+                    }
+                }
+            });
+    });
+}
+
+// ============================================================
+// Fast pass -- WenoStencil::gatherIndices + WenoStencil<W> compute
+// ============================================================
+//
+// Structure:
+//   for each VBM block:
+//     decodeInverseMaps -> leafIndex[128], voxelOffset[128]
+//     for each batch of SIMDw voxels:
+//       fill: scalar scatter from sidecar into raw_values[SIZE][SIMDw]
+//             via WenoStencil::gatherIndices() per voxel
+//       load: per-tap SIMD load into stencil.values[] / isActive[]
+//       extrapolate (sign-fix OOB lanes in-place, Simd)
+//       normSqGrad -> FloatV
+//       store: per-lane scalar write to outputFast[blockBase + p]
+
+static double
+runFast(const IndexGridT&                                                  indexGrid,
+        const nanovdb::tools::VoxelBlockManagerHandle<nanovdb::HostBuffer>& vbmHandle,
+        const std::vector<float>&                                          sidecar,
+        std::vector<float>&                                                outputFast)
+{
+    std::fill(outputFast.begin(), outputFast.end(), 0.f);
+
+    const LeafT*    firstLeaf   = indexGrid.tree().template getFirstNode<0>();
+    const uint32_t  nBlocks     = (uint32_t)vbmHandle.blockCount();
+    const uint32_t* firstLeafID = vbmHandle.hostFirstLeafID();
+    const uint64_t* jumpMap     = vbmHandle.hostJumpMap();
+    const uint64_t  firstOffset = vbmHandle.firstOffset();
+
+    const float dx            = float(indexGrid.voxelSize()[0]);
+
+    std::ostringstream sink;
+    nanovdb::util::Timer timer;
+    auto timeIt = [&](auto&& body) -> double {
+        timer.start("", sink);
+        body();
+        return static_cast<double>(timer.elapsed<std::chrono::microseconds>());
+    };
+
+    return timeIt([&] {
+        nanovdb::util::forEach(size_t(0), size_t(nBlocks), size_t(1),
+            [&](const nanovdb::util::Range1D& range) {
+                using FloatV   = nanovdb::util::experimental::Simd    <float, SIMDw>;
+                using MaskV    = nanovdb::util::experimental::SimdMask<float, SIMDw>;
+                using StencilT = nanovdb::WenoStencil<FloatV>;
+                constexpr int SIZE = StencilT::size();
+
+                alignas(64) uint32_t leafIndex[BlockWidth];
+                alignas(64) uint16_t voxelOffset[BlockWidth];
+
+                // Caller-owned fill-side scratch -- scalar scatter from the
+                // sidecar lands here, then a per-tap SIMD load moves the
+                // data into the stencil's Simd compute view.
+                alignas(64) float raw_values[SIZE][SIMDw];
+                alignas(64) bool  raw_active[SIZE][SIMDw];
+
+                StencilT stencil(dx);
+
+                // One leaf-only ReadAccessor per TBB task; cache stays warm
+                // across the SIZE getValue calls in WenoStencil::gatherIndices().
+                nanovdb::ReadAccessor<BuildT, 0, -1, -1> acc(indexGrid.tree().root());
+
+                const float* const scIn  = sidecar.data();
+                float*       const scOut = outputFast.data();
+
+                for (size_t bID = range.begin(); bID != range.end(); ++bID) {
+                    CPUVBM::decodeInverseMaps(
+                        &indexGrid, firstLeafID[bID],
+                        &jumpMap[bID * CPUVBM::JumpMapLength],
+                        firstOffset + bID * BlockWidth,
+                        leafIndex, voxelOffset);
+
+                    const uint64_t blockBase =
+                        firstOffset + (uint64_t)bID * BlockWidth;
+
+                    for (int batchStart = 0; batchStart < BlockWidth; batchStart += SIMDw) {
+                        // -------- Fill: per-voxel gatherIndices + sidecar lookup --------
+                        for (int i = 0; i < SIMDw; ++i) {
+                            const int p = batchStart + i;
+
+                            if (leafIndex[p] == CPUVBM::UnusedLeafIndex) {
+                                for (int k = 0; k < SIZE; ++k) {
+                                    raw_values[k][i] = 0.f;
+                                    raw_active[k][i] = false;
+                                }
+                                continue;
+                            }
+
+                            const uint16_t vo = voxelOffset[p];
+                            const uint32_t li = leafIndex[p];
+                            const nanovdb::Coord cOrigin = firstLeaf[li].origin();
+                            const int lx = (vo >> 6) & 7, ly = (vo >> 3) & 7, lz = vo & 7;
+                            const nanovdb::Coord center =
+                                cOrigin + nanovdb::Coord(lx, ly, lz);
+
+                            uint64_t indices[SIZE];
+                            nanovdb::WenoStencil<float>::gatherIndices(acc, center, indices);
+                            for (int k = 0; k < SIZE; ++k) {
+                                raw_values[k][i] = scIn[indices[k]];
+                                raw_active[k][i] = (indices[k] != 0);
+                            }
+                        }
+
+                        // -------- Load: per-tap SIMD load into stencil view --------
+                        for (int k = 0; k < SIZE; ++k) {
+                            stencil.values  [k] = FloatV(raw_values[k], nanovdb::util::experimental::element_aligned);
+                            stencil.isActive[k] = MaskV (raw_active[k], nanovdb::util::experimental::element_aligned);
+                        }
+
+                        // -------- Phase-3 arithmetic (in-place on Simd values) --------
+                        stencil.extrapolate();
+                        const FloatV result = stencil.normSqGrad(/*iso=*/0.f);
+
+                        // -------- Simd -> scalar bridge + per-lane store --------
+                        alignas(64) float result_lanes[SIMDw];
+                        result.copy_to(result_lanes, nanovdb::util::experimental::element_aligned);
+                        for (int i = 0; i < SIMDw; ++i) {
+                            const int p = batchStart + i;
+                            if (leafIndex[p] == CPUVBM::UnusedLeafIndex) continue;
+                            scOut[blockBase + p] = result_lanes[i];
+                        }
+                    }
+                }
+            });
+    });
+}
+
+// ============================================================
+// Histogram comparison
+// ============================================================
+//
+// Per-index |outputRef[i] - outputFast[i]| over all active voxels
+// (index 1..N; slot 0 is the background/no-op).  Log-decade bins
+// from 0 to 1e+1 plus a tail bucket for anything >= 10.
+//
+// Expected shape: the two leftmost bins ([0,1e-8), [1e-8,1e-7)) hold
+// the overwhelming majority -- FP-rounding / FMA-fusion differences.
+// Anything to the right of [1e-5,1e-4) warrants investigation.
+
+static void
+reportHistogram(const std::vector<float>& outputRef,
+                const std::vector<float>& outputFast,
+                uint64_t                  nActive)
+{
+    // Bucket edges: 0, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e+0, 1e+1
+    static constexpr int    nBuckets = 12;
+    static constexpr double edges[nBuckets + 1] = {
+        0.0, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6,
+        1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0
+    };
+    static const char* labels[nBuckets] = {
+        "[0,      1e-10)",
+        "[1e-10,  1e-9 )",
+        "[1e-9,   1e-8 )",
+        "[1e-8,   1e-7 )",
+        "[1e-7,   1e-6 )",
+        "[1e-6,   1e-5 )",
+        "[1e-5,   1e-4 )",
+        "[1e-4,   1e-3 )",
+        "[1e-3,   1e-2 )",
+        "[1e-2,   1e-1 )",
+        "[1e-1,   1.0  )",
+        "[1.0,    10.0 )"
+    };
+
+    std::array<uint64_t, nBuckets + 1> counts{};  // last bucket = [10, inf)
+    double   sumDelta   = 0.0;
+    float    maxDelta   = 0.f;
+    uint64_t worstIdx   = 0;
+    uint64_t counted    = 0;
+
+    for (uint64_t i = 1; i <= nActive; ++i) {
+        const float d = std::abs(outputRef[i] - outputFast[i]);
+        ++counted;
+        sumDelta += double(d);
+        if (d > maxDelta) { maxDelta = d; worstIdx = i; }
+
+        int b = nBuckets;  // overflow bucket
+        for (int k = 0; k < nBuckets; ++k) {
+            if (double(d) < edges[k + 1]) { b = k; break; }
+        }
+        ++counts[b];
+    }
+
+    std::printf("\n|Delta| histogram across %lu active voxels"
+                " (outputRef vs outputFast):\n", counted);
+    for (int k = 0; k < nBuckets; ++k) {
+        const double pct = counted ? 100.0 * double(counts[k]) / double(counted) : 0.0;
+        std::printf("  %-18s : %12lu   (%6.2f%%)\n",
+                    labels[k], counts[k], pct);
+    }
+    const double pctTail = counted ? 100.0 * double(counts[nBuckets]) / double(counted) : 0.0;
+    std::printf("  %-18s : %12lu   (%6.2f%%)\n",
+                "[10.0,   inf  )", counts[nBuckets], pctTail);
+
+    std::printf("\n  max  |Delta| = %.6g   (at slot %lu:"
+                " ref=%.6g, fast=%.6g)\n",
+                double(maxDelta), worstIdx,
+                double(outputRef[worstIdx]),
+                double(outputFast[worstIdx]));
+    std::printf("  mean |Delta| = %.6g\n",
+                counted ? sumDelta / double(counted) : 0.0);
+}
+
+// ============================================================
+// Entry point
+// ============================================================
+
+static void printUsage(const char* argv0)
+{
+    std::cerr
+        << "Usage: " << argv0 << " <path.vdb>"
+        << " [--grid=<name>] [--threads=<n>]\n"
+        << "\n"
+        << "  <path.vdb>           Input OpenVDB file (single FloatGrid narrow-band)\n"
+        << "  --grid=<name>        Select grid by name (default: first FloatGrid)\n"
+        << "  --threads=<n>        Limit TBB parallelism (0 = TBB default)\n";
+}
+
+int main(int argc, char** argv)
+{
+    try {
+        if (argc < 2 || std::string(argv[1]) == "--help"
+                     || std::string(argv[1]) == "-h") {
+            printUsage(argv[0]);
+            return argc < 2 ? 1 : 0;
+        }
+
+        std::string vdbPath  = argv[1];
+        std::string gridName = "";
+        int         nThreads = 0;
+
+        for (int i = 2; i < argc; ++i) {
+            std::string a = argv[i];
+            if (a.rfind("--grid=", 0) == 0)         gridName = a.substr(7);
+            else if (a.rfind("--threads=", 0) == 0) nThreads = std::stoi(a.substr(10));
+            else { printUsage(argv[0]); return 1; }
+        }
+
+        std::cout << "vdb path       = " << vdbPath << "\n"
+                  << "grid name      = " << (gridName.empty() ? "(first FloatGrid)" : gridName) << "\n"
+                  << "threads        = " << (nThreads > 0 ? std::to_string(nThreads)
+                                                          : std::string("(TBB default)")) << "\n";
+
+        // ---- Load the .vdb and convert to both NanoVDB representations ----
+        openvdb::initialize();
+        auto floatGrid = loadFloatGridFromVdb(vdbPath, gridName);
+
+        const auto bbox  = floatGrid->evalActiveVoxelBoundingBox();
+        const auto vsize = floatGrid->voxelSize();
+        std::cout << "FloatGrid:\n"
+                  << "  active voxels  = " << floatGrid->activeVoxelCount() << "\n"
+                  << "  bbox           = [" << bbox.min() << " .. " << bbox.max() << "]\n"
+                  << "  voxel size     = " << vsize << "\n"
+                  << "  background     = " << floatGrid->background() << "\n";
+
+        auto payload = convertFloatGrid(*floatGrid);
+        auto* nanoFloatGrid = payload.floatHandle.grid<float>();
+        auto* indexGrid     = payload.indexHandle.grid<nanovdb::ValueOnIndex>();
+        if (!nanoFloatGrid || !indexGrid)
+            throw std::runtime_error("NanoVDB conversion failed");
+
+        const auto& tree = indexGrid->tree();
+        std::cout << "NanoVDB:\n"
+                  << "  leaves         = " << tree.nodeCount(0) << "\n"
+                  << "  active voxels  = " << indexGrid->activeVoxelCount() << "\n"
+                  << "  sidecar size   = " << payload.sidecar.size() << "\n";
+
+        // ---- VBM for the fast path ----
+        auto vbmHandle = nanovdb::tools::buildVoxelBlockManager<Log2BlockWidth>(indexGrid);
+        std::cout << "VBM:\n"
+                  << "  blocks         = " << vbmHandle.blockCount()
+                  << "  (BlockWidth=" << BlockWidth << ")\n\n";
+
+        // ---- TBB thread cap for timings ----
+        std::unique_ptr<tbb::global_control> tbbLimit;
+        if (nThreads > 0) {
+            tbbLimit = std::make_unique<tbb::global_control>(
+                tbb::global_control::max_allowed_parallelism, (size_t)nThreads);
+        }
+
+        // ---- Output buffers ----
+        std::vector<float> outputRef (payload.sidecar.size(), 0.f);
+        std::vector<float> outputFast(payload.sidecar.size(), 0.f);
+
+        // ---- Run both passes (warm + timed) ----
+        // Warm pass (ignored) for both, then one timed pass each.
+        (void)runReference(*nanoFloatGrid, *indexGrid, outputRef);
+        (void)runFast(*indexGrid, vbmHandle, payload.sidecar, outputFast);
+
+        const double refUs  = runReference(*nanoFloatGrid, *indexGrid, outputRef);
+        const double fastUs = runFast(*indexGrid, vbmHandle, payload.sidecar, outputFast);
+
+        const uint64_t nActive = indexGrid->activeVoxelCount();
+        const double   refNs   = refUs  * 1e3 / double(nActive);
+        const double   fastNs  = fastUs * 1e3 / double(nActive);
+
+        std::printf("\nEnd-to-end WENO5 |grad phi|^2 (%lu active voxels):\n", nActive);
+        std::printf("  reference (scalar): %9.1f ms  (%7.1f ns/voxel)\n",
+                    refUs / 1e3, refNs);
+        std::printf("  fast  (VBM+SIMD)  : %9.1f ms  (%7.2f ns/voxel)   speedup: %.1fx\n",
+                    fastUs / 1e3, fastNs, refUs / std::max(fastUs, 1.0));
+
+        // ---- Histogram of discrepancies ----
+        reportHistogram(outputRef, outputFast, nActive);
+
+    } catch (const std::exception& e) {
+        std::cerr << "Exception: " << e.what() << "\n";
+        return 1;
+    }
+    return 0;
+}
diff --git a/nanovdb/nanovdb/math/Math.h b/nanovdb/nanovdb/math/Math.h
index 19b1beee81..a3ee0df993 100644
--- a/nanovdb/nanovdb/math/Math.h
+++ b/nanovdb/nanovdb/math/Math.h
@@ -172,6 +172,11 @@ __hostdev__ inline double Max(double a, double b)
 {
     return fmax(a, b);
 }
+template<typename T>
+__hostdev__ inline T Select(bool m, T a, T b)
+{
+    return m ? a : b;
+}
 __hostdev__ inline float Clamp(float x, float a, float b)
 {
     return Max(Min(x, b), a);
diff --git a/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh b/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh
index 02f5180f67..d53839cab3 100644
--- a/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh
+++ b/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh
@@ -30,6 +30,7 @@
 #include <nanovdb/util/cuda/Util.h>
 #include <nanovdb/util/cuda/DeviceGridTraits.cuh>
 #include <nanovdb/tools/VoxelBlockManager.h>
+#include <nanovdb/math/Stencils.h>
 
 namespace nanovdb {
 
@@ -166,6 +167,144 @@ struct VoxelBlockManager : nanovdb::tools::VoxelBlockManagerBase<Log2BlockWidth>
             }
         }
     }
+
+    /// @brief Auxiliary type holding the resolved neighbor leaf pointers for
+    /// the WENO5 stencil. ptrs[axis][0] is the lo neighbor along that axis
+    /// (nullptr if outside the narrow band), ptrs[axis][1] is always the
+    /// center leaf, and ptrs[axis][2] is the hi neighbor (nullptr if outside).
+    template<class BuildT>
+    struct WenoLeafPtrs {
+        const NanoLeaf<BuildT>* ptrs[3][3];
+    };
+
+    /// @brief Resolve the neighbor leaf pointers needed by computeWenoStencil.
+    /// Performs exactly one probeLeaf call per axis (three total). Safe to call
+    /// per-thread; does not synchronize.
+    /// @tparam BuildT Build type of the grid (must be an index type)
+    /// @param grid    Device-resident grid
+    /// @param leaf    Center leaf node for the current voxel
+    /// @param voxelOffset Intra-leaf voxel offset for the current voxel
+    /// @return WenoLeafPtrs with center entries set to &leaf and lo/hi entries
+    ///         set to the probeLeaf result (nullptr if outside the narrow band)
+    template<class BuildT>
+    __device__
+    static typename util::enable_if<BuildTraits<BuildT>::is_index, WenoLeafPtrs<BuildT>>::type
+    resolveWenoLeafPtrs(
+        const NanoGrid<BuildT>* grid,
+        const NanoLeaf<BuildT>& leaf,
+        uint16_t                voxelOffset)
+    {
+        WenoLeafPtrs<BuildT> result;
+        const auto coord      = leaf.offsetToGlobalCoord(voxelOffset);
+        const auto localCoord = leaf.OffsetToLocalCoord(voxelOffset);
+        const auto& tree      = grid->tree();
+
+        for (int axis = 0; axis < 3; ++axis) {
+            result.ptrs[axis][0] = nullptr;
+            result.ptrs[axis][1] = &leaf;
+            result.ptrs[axis][2] = nullptr;
+
+            auto neighborCoord = coord;
+            neighborCoord[axis] += (localCoord[axis] & 0x4) ? 4 : -4;
+            result.ptrs[axis][(localCoord[axis] & 0x4) >> 1] =
+                tree.root().probeLeaf(neighborCoord);
+        }
+        return result;
+    }
+
+    /// @brief Compute global sequential indices for the 19 WENO5 stencil
+    /// points of the given voxel, using pre-resolved leaf pointers.
+    ///
+    /// Output layout follows nanovdb::math::WenoPt<i,j,k>::idx. Note that
+    /// this convention differs from OpenVDB's NineteenPt<i,j,k>::idx.
+    ///
+    /// Entries for neighbors outside the narrow band are left unchanged;
+    /// the caller must zero-initialize data[] before calling this function.
+    /// Does not synchronize; safe to call from divergent threads.
+    ///
+    /// The voxelOffset arithmetic uses octal notation to exploit the fact that
+    /// the NanoVDB leaf layout encodes (x,y,z) as x*64 + y*8 + z, making x, y,
+    /// and z strides exactly 0100, 010, and 1 in octal respectively.
+    ///
+    /// @tparam BuildT Build type of the grid (must be an index type)
+    /// @param leaf        Center leaf node for the current voxel
+    /// @param voxelOffset Intra-leaf voxel offset for the current voxel
+    /// @param leafPtrs    Resolved neighbor leaf pointers from resolveWenoLeafPtrs
+    /// @param data        Output array of length >= 19, caller-zero-initialized
+    template<class BuildT>
+    __device__
+    static typename util::enable_if<BuildTraits<BuildT>::is_index, void>::type
+    computeWenoStencil(
+        const NanoLeaf<BuildT>&     leaf,
+        uint16_t                    voxelOffset,
+        const WenoLeafPtrs<BuildT>& leafPtrs,
+        uint64_t*                   data)
+    {
+        using math::WenoPt;
+        const auto lc = leaf.OffsetToLocalCoord(voxelOffset);
+
+        data[WenoPt< 0, 0, 0>::idx] = leaf.getValue(voxelOffset);
+
+        // x-axis: stride per step = 64 = 0100 octal; cross-leaf wrap = ±8*64 = ±0700
+        if (leafPtrs.ptrs[0][(lc.x() + 5) >> 3])
+            data[WenoPt<-3, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 5) >> 3]->getValue(
+                voxelOffset + ((lc[0] < 3) ? 0500 : -0300));
+        if (leafPtrs.ptrs[0][(lc.x() + 6) >> 3])
+            data[WenoPt<-2, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 6) >> 3]->getValue(
+                voxelOffset + ((lc[0] < 2) ? 0600 : -0200));
+        if (leafPtrs.ptrs[0][(lc.x() + 7) >> 3])
+            data[WenoPt<-1, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 7) >> 3]->getValue(
+                voxelOffset + ((lc[0] < 1) ? 0700 : -0100));
+        if (leafPtrs.ptrs[0][(lc.x() + 9) >> 3])
+            data[WenoPt< 1, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 9) >> 3]->getValue(
+                voxelOffset + ((lc[0] < 7) ? 0100 : -0700));
+        if (leafPtrs.ptrs[0][(lc.x() + 10) >> 3])
+            data[WenoPt< 2, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 10) >> 3]->getValue(
+                voxelOffset + ((lc[0] < 6) ? 0200 : -0600));
+        if (leafPtrs.ptrs[0][(lc.x() + 11) >> 3])
+            data[WenoPt< 3, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 11) >> 3]->getValue(
+                voxelOffset + ((lc[0] < 5) ? 0300 : -0500));
+
+        // y-axis: stride per step = 8 = 010 octal; cross-leaf wrap = ±8*8 = ±070
+        if (leafPtrs.ptrs[1][(lc.y() + 5) >> 3])
+            data[WenoPt< 0,-3, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 5) >> 3]->getValue(
+                voxelOffset + ((lc[1] < 3) ? 0050 : -0030));
+        if (leafPtrs.ptrs[1][(lc.y() + 6) >> 3])
+            data[WenoPt< 0,-2, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 6) >> 3]->getValue(
+                voxelOffset + ((lc[1] < 2) ? 0060 : -0020));
+        if (leafPtrs.ptrs[1][(lc.y() + 7) >> 3])
+            data[WenoPt< 0,-1, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 7) >> 3]->getValue(
+                voxelOffset + ((lc[1] < 1) ? 0070 : -0010));
+        if (leafPtrs.ptrs[1][(lc.y() + 9) >> 3])
+            data[WenoPt< 0, 1, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 9) >> 3]->getValue(
+                voxelOffset + ((lc[1] < 7) ? 0010 : -0070));
+        if (leafPtrs.ptrs[1][(lc.y() + 10) >> 3])
+            data[WenoPt< 0, 2, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 10) >> 3]->getValue(
+                voxelOffset + ((lc[1] < 6) ? 0020 : -0060));
+        if (leafPtrs.ptrs[1][(lc.y() + 11) >> 3])
+            data[WenoPt< 0, 3, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 11) >> 3]->getValue(
+                voxelOffset + ((lc[1] < 5) ? 0030 : -0050));
+
+        // z-axis: stride per step = 1; cross-leaf wrap = ±8
+        if (leafPtrs.ptrs[2][(lc.z() + 5) >> 3])
+            data[WenoPt< 0, 0,-3>::idx] = leafPtrs.ptrs[2][(lc.z() + 5) >> 3]->getValue(
+                voxelOffset + ((lc[2] < 3) ? 0005 : -0003));
+        if (leafPtrs.ptrs[2][(lc.z() + 6) >> 3])
+            data[WenoPt< 0, 0,-2>::idx] = leafPtrs.ptrs[2][(lc.z() + 6) >> 3]->getValue(
+                voxelOffset + ((lc[2] < 2) ? 0006 : -0002));
+        if (leafPtrs.ptrs[2][(lc.z() + 7) >> 3])
+            data[WenoPt< 0, 0,-1>::idx] = leafPtrs.ptrs[2][(lc.z() + 7) >> 3]->getValue(
+                voxelOffset + ((lc[2] < 1) ? 0007 : -0001));
+        if (leafPtrs.ptrs[2][(lc.z() + 9) >> 3])
+            data[WenoPt< 0, 0, 1>::idx] = leafPtrs.ptrs[2][(lc.z() + 9) >> 3]->getValue(
+                voxelOffset + ((lc[2] < 7) ? 0001 : -0007));
+        if (leafPtrs.ptrs[2][(lc.z() + 10) >> 3])
+            data[WenoPt< 0, 0, 2>::idx] = leafPtrs.ptrs[2][(lc.z() + 10) >> 3]->getValue(
+                voxelOffset + ((lc[2] < 6) ? 0002 : -0006));
+        if (leafPtrs.ptrs[2][(lc.z() + 11) >> 3])
+            data[WenoPt< 0, 0, 3>::idx] = leafPtrs.ptrs[2][(lc.z() + 11) >> 3]->getValue(
+                voxelOffset + ((lc[2] < 5) ? 0003 : -0005));
+    }
 };
 
 /// @brief This functor calculates the firstLeafID and jumpMap for the
diff --git a/nanovdb/nanovdb/util/ForEach.h b/nanovdb/nanovdb/util/ForEach.h
index d71769c5ab..6ede538a10 100644
--- a/nanovdb/nanovdb/util/ForEach.h
+++ b/nanovdb/nanovdb/util/ForEach.h
@@ -45,7 +45,7 @@ inline void forEach(RangeT range, const FuncT &func)
 #ifdef NANOVDB_USE_TBB
     tbb::parallel_for(range, func);
 #else// naive and likely slow alternative based on std::thread
-    if (const size_t threadCount = std::thread::hardware_concurrency()>>1) {
+    if (const size_t threadCount = std::thread::hardware_concurrency()) {
         std::vector<RangeT> rangePool{ range };
         while(rangePool.size() < threadCount) {
             const size_t oldSize = rangePool.size();
diff --git a/nanovdb/nanovdb/util/Simd.h b/nanovdb/nanovdb/util/Simd.h
new file mode 100644
index 0000000000..363a307fac
--- /dev/null
+++ b/nanovdb/nanovdb/util/Simd.h
@@ -0,0 +1,297 @@
+// Copyright Contributors to the OpenVDB Project
+// SPDX-License-Identifier: Apache-2.0
+
+/*!
+    \file   nanovdb/util/Simd.h
+
+    \author Efty Sifakis
+
+    \date   April 28, 2026
+
+    \brief  Minimal SIMD abstraction for NanoVDB stencil kernels.
+*/
+
+#ifndef NANOVDB_UTIL_SIMD_H_HAS_BEEN_INCLUDED
+#define NANOVDB_UTIL_SIMD_H_HAS_BEEN_INCLUDED
+
+#include <array>
+
+#include <nanovdb/util/Util.h>  // __hostdev__
+
+// Two implementations, selected automatically at compile time:
+//
+//   stdx backend (default, when <experimental/simd> is available):
+//     Simd<T,W> and SimdMask<T,W> are aliases for
+//     std::experimental::fixed_size_simd / fixed_size_simd_mask.
+//     Internal flag: NANOVDB_USE_STDX_SIMD.
+//
+//   std::array backend (forced via NANOVDB_SIMD_ARRAY_BACKEND, or when
+//   <experimental/simd> is unavailable):
+//     Simd<T,W> wraps std::array<T,W> with element-wise operator loops.
+//
+// The interface is identical in both cases, so templated kernels
+// (T=float for GPU, T=Simd<float,W> for CPU) compile unmodified.
+
+// ---------------------------------------------------------------------------
+// Auto-detect std::experimental::simd (Parallelism TS v2)
+// ---------------------------------------------------------------------------
+#if !defined(NANOVDB_SIMD_ARRAY_BACKEND) && defined(__has_include) && __has_include(<experimental/simd>)
+#  include <experimental/simd>
+#  ifdef __cpp_lib_experimental_parallel_simd
+#    define NANOVDB_USE_STDX_SIMD 1
+#  endif
+#endif
+
+namespace nanovdb {
+namespace util {
+
+// ===========================================================================
+// nanovdb::util::experimental -- internal SIMD primitives.  Names in this
+// nested namespace are unstable API by convention; external callers should
+// not depend on them.
+// ===========================================================================
+namespace experimental {
+
+// element_aligned_tag -- load/store alignment descriptor.  Aliases
+// stdx::element_aligned_tag in the stdx backend; empty struct in the
+// array backend.
+#ifdef NANOVDB_USE_STDX_SIMD
+namespace stdx = std::experimental;
+using element_aligned_tag = stdx::element_aligned_tag;
+#else
+struct element_aligned_tag {};
+#endif
+inline constexpr element_aligned_tag element_aligned{};
+
+// ===========================================================================
+// Implementation A: std::experimental::simd -- pure type aliases
+// ===========================================================================
+#ifdef NANOVDB_USE_STDX_SIMD
+
+template<typename T, int W>
+using SimdMask = stdx::fixed_size_simd_mask<T, W>;
+
+template<typename T, int W>
+using Simd = stdx::fixed_size_simd<T, W>;
+
+// ===========================================================================
+// Implementation B: std::array backend (default)
+// ===========================================================================
+#else
+
+template<typename T, int W> struct Simd;  // fwd-decl so SimdMask::simd_type can name it
+
+template<typename T, int W>
+struct SimdMask {
+    using value_type = bool;
+    using simd_type  = Simd<T, W>;
+    static constexpr size_t size() { return size_t(W); }
+
+    std::array<bool, W> data{};
+    SimdMask() = default;
+    __hostdev__ SimdMask(bool b) { data.fill(b); }                       // broadcast
+    __hostdev__ explicit SimdMask(const bool* p, element_aligned_tag) {  // load (ctor)
+        for (int i = 0; i < W; i++) data[i] = p[i];
+    }
+    __hostdev__ bool  operator[](int i) const { return data[i]; }
+    __hostdev__ bool& operator[](int i)       { return data[i]; }
+    __hostdev__ void copy_from(const bool* p, element_aligned_tag) {     // load (member)
+        for (int i = 0; i < W; i++) data[i] = p[i];
+    }
+    __hostdev__ void copy_to(bool* p, element_aligned_tag) const {       // store (member)
+        for (int i = 0; i < W; i++) p[i] = data[i];
+    }
+    __hostdev__ SimdMask operator!() const {
+        SimdMask r; for (int i = 0; i < W; i++) r.data[i] = !data[i]; return r;
+    }
+    __hostdev__ SimdMask operator&(SimdMask o) const {
+        SimdMask r; for (int i = 0; i < W; i++) r.data[i] = data[i] && o.data[i]; return r;
+    }
+    __hostdev__ SimdMask operator|(SimdMask o) const {
+        SimdMask r; for (int i = 0; i < W; i++) r.data[i] = data[i] || o.data[i]; return r;
+    }
+};
+
+template<typename T, int W>
+struct Simd {
+    using value_type = T;
+    using mask_type  = SimdMask<T, W>;
+    static constexpr size_t size() { return size_t(W); }
+
+    std::array<T, W> data{};
+
+    Simd() = default;
+    __hostdev__ Simd(T scalar) { data.fill(scalar); }               // broadcast
+    __hostdev__ explicit Simd(const T* p, element_aligned_tag) {    // load (ctor)
+        for (int i = 0; i < W; i++) data[i] = p[i];
+    }
+    __hostdev__ T  operator[](int i) const { return data[i]; }
+    __hostdev__ T& operator[](int i)       { return data[i]; }
+    __hostdev__ void copy_from(const T* p, element_aligned_tag) {   // load (member)
+        for (int i = 0; i < W; i++) data[i] = p[i];
+    }
+    __hostdev__ void copy_to(T* p, element_aligned_tag) const {     // store (member)
+        for (int i = 0; i < W; i++) p[i] = data[i];
+    }
+    __hostdev__ Simd operator-() const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = -data[i]; return r;
+    }
+    __hostdev__ Simd operator+(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] + o.data[i]; return r;
+    }
+    __hostdev__ Simd operator-(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] - o.data[i]; return r;
+    }
+    __hostdev__ Simd operator*(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] * o.data[i]; return r;
+    }
+    __hostdev__ Simd operator/(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] / o.data[i]; return r;
+    }
+    __hostdev__ Simd& operator+=(Simd o) {
+        for (int i = 0; i < W; i++) data[i] += o.data[i]; return *this;
+    }
+    __hostdev__ Simd& operator-=(Simd o) {
+        for (int i = 0; i < W; i++) data[i] -= o.data[i]; return *this;
+    }
+    __hostdev__ Simd& operator*=(Simd o) {
+        for (int i = 0; i < W; i++) data[i] *= o.data[i]; return *this;
+    }
+    __hostdev__ Simd& operator/=(Simd o) {
+        for (int i = 0; i < W; i++) data[i] /= o.data[i]; return *this;
+    }
+    __hostdev__ SimdMask<T,W> operator<(Simd o) const {
+        SimdMask<T,W> m; for (int i = 0; i < W; i++) m.data[i] = data[i] <  o.data[i]; return m;
+    }
+    __hostdev__ SimdMask<T,W> operator<=(Simd o) const {
+        SimdMask<T,W> m; for (int i = 0; i < W; i++) m.data[i] = data[i] <= o.data[i]; return m;
+    }
+    __hostdev__ SimdMask<T,W> operator>(Simd o) const {
+        SimdMask<T,W> m; for (int i = 0; i < W; i++) m.data[i] = data[i] >  o.data[i]; return m;
+    }
+    __hostdev__ SimdMask<T,W> operator>=(Simd o) const {
+        SimdMask<T,W> m; for (int i = 0; i < W; i++) m.data[i] = data[i] >= o.data[i]; return m;
+    }
+    __hostdev__ SimdMask<T,W> operator==(Simd o) const {
+        SimdMask<T,W> m; for (int i = 0; i < W; i++) m.data[i] = data[i] == o.data[i]; return m;
+    }
+    __hostdev__ SimdMask<T,W> operator!=(Simd o) const {
+        SimdMask<T,W> m; for (int i = 0; i < W; i++) m.data[i] = data[i] != o.data[i]; return m;
+    }
+    // Bitwise and shift operators -- valid for integer element types.
+    __hostdev__ Simd operator|(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] | o.data[i]; return r;
+    }
+    __hostdev__ Simd operator&(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] & o.data[i]; return r;
+    }
+    __hostdev__ Simd operator^(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] ^ o.data[i]; return r;
+    }
+    // Per-lane variable shift (shift count from corresponding lane of o).
+    __hostdev__ Simd operator<<(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] << o.data[i]; return r;
+    }
+    __hostdev__ Simd operator>>(Simd o) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] >> o.data[i]; return r;
+    }
+    // Uniform shift: all lanes shifted by the same scalar count (vpsllw imm8 / vpsrlw imm8).
+    __hostdev__ Simd operator<<(T shift) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] << shift; return r;
+    }
+    __hostdev__ Simd operator>>(T shift) const {
+        Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] >> shift; return r;
+    }
+};
+
+template<typename T, int W> __hostdev__
+Simd<T,W> operator+(T a, Simd<T,W> b) { return Simd<T,W>(a) + b; }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator+(Simd<T,W> a, T b) { return a + Simd<T,W>(b); }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator-(T a, Simd<T,W> b) { return Simd<T,W>(a) - b; }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator-(Simd<T,W> a, T b) { return a - Simd<T,W>(b); }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator*(T a, Simd<T,W> b) { return Simd<T,W>(a) * b; }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator*(Simd<T,W> a, T b) { return a * Simd<T,W>(b); }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator/(T a, Simd<T,W> b) { return Simd<T,W>(a) / b; }
+template<typename T, int W> __hostdev__
+Simd<T,W> operator/(Simd<T,W> a, T b) { return a / Simd<T,W>(b); }
+
+#endif // NANOVDB_USE_STDX_SIMD
+
+} // namespace experimental
+} // namespace util
+
+// ---------------------------------------------------------------------------
+// nanovdb::math::Min / Max / Select / Sqrt -- Simd<T,W> overloads.  Scalar
+// overloads live in nanovdb/math/Math.h; defining the SIMD overloads here
+// avoids a Math.h -> Simd.h dependency.
+// ---------------------------------------------------------------------------
+namespace math {
+#ifdef NANOVDB_USE_STDX_SIMD
+template<typename T, int W>
+NANOVDB_FORCEINLINE util::experimental::Simd<T,W>
+Min(util::experimental::Simd<T,W> a, util::experimental::Simd<T,W> b) {
+    return std::experimental::min(a, b);
+}
+template<typename T, int W>
+NANOVDB_FORCEINLINE util::experimental::Simd<T,W>
+Max(util::experimental::Simd<T,W> a, util::experimental::Simd<T,W> b) {
+    return std::experimental::max(a, b);
+}
+// 3-arg Select(mask, a, b): mask[i] ? a[i] : b[i], via TS v2's where() proxy.
+template<typename T, int W>
+NANOVDB_FORCEINLINE util::experimental::Simd<T,W>
+Select(util::experimental::SimdMask<T,W> mask,
+       util::experimental::Simd<T,W> a,
+       util::experimental::Simd<T,W> b) {
+    auto result = b;
+    util::experimental::stdx::where(mask, result) = a;
+    return result;
+}
+template<typename T, int W>
+NANOVDB_FORCEINLINE util::experimental::Simd<T,W>
+Sqrt(util::experimental::Simd<T,W> a) {
+    return std::experimental::sqrt(a);
+}
+#else
+template<typename T, int W>
+__hostdev__ util::experimental::Simd<T,W>
+Min(util::experimental::Simd<T,W> a, util::experimental::Simd<T,W> b) {
+    util::experimental::Simd<T,W> r;
+    for (int i = 0; i < W; i++) r[i] = a[i] < b[i] ? a[i] : b[i];
+    return r;
+}
+template<typename T, int W>
+__hostdev__ util::experimental::Simd<T,W>
+Max(util::experimental::Simd<T,W> a, util::experimental::Simd<T,W> b) {
+    util::experimental::Simd<T,W> r;
+    for (int i = 0; i < W; i++) r[i] = a[i] > b[i] ? a[i] : b[i];
+    return r;
+}
+template<typename T, int W>
+__hostdev__ util::experimental::Simd<T,W>
+Select(util::experimental::SimdMask<T,W> mask,
+       util::experimental::Simd<T,W> a,
+       util::experimental::Simd<T,W> b) {
+    util::experimental::Simd<T,W> r;
+    for (int i = 0; i < W; i++) r[i] = mask[i] ? a[i] : b[i];
+    return r;
+}
+template<typename T, int W>
+__hostdev__ util::experimental::Simd<T,W>
+Sqrt(util::experimental::Simd<T,W> a) {
+    util::experimental::Simd<T,W> r;
+    for (int i = 0; i < W; i++) r[i] = Sqrt(a[i]);
+    return r;
+}
+#endif
+} // namespace math
+
+} // namespace nanovdb
+
+#endif // end of NANOVDB_UTIL_SIMD_H_HAS_BEEN_INCLUDED
diff --git a/nanovdb/nanovdb/util/Util.h b/nanovdb/nanovdb/util/Util.h
index bdff640a97..4040e9bfd0 100644
--- a/nanovdb/nanovdb/util/Util.h
+++ b/nanovdb/nanovdb/util/Util.h
@@ -87,15 +87,22 @@ typedef unsigned long long uint64_t;
 
 #endif // if defined(__CUDACC__) || defined(__HIP__)
 
-// NANOVDB_RESTRICT: cross-compiler no-alias hint for pointer parameters.
-// GCC and Clang (including NVCC host compilation) spell it __restrict__,
-// MSVC spells it __restrict.
+// NANOVDB_RESTRICT: cross-compiler no-alias hint for pointer parameters
 #if defined(_MSC_VER)
 #define NANOVDB_RESTRICT __restrict
 #else
 #define NANOVDB_RESTRICT __restrict__
 #endif
 
+// NANOVDB_FORCEINLINE: force inlining at the call site
+#if defined(_MSC_VER)
+#define NANOVDB_FORCEINLINE __forceinline
+#elif defined(__GNUC__) || defined(__clang__)
+#define NANOVDB_FORCEINLINE inline __attribute__((always_inline))
+#else
+#define NANOVDB_FORCEINLINE inline
+#endif
+
 // The following macro will suppress annoying warnings when nvcc
 // compiles functions that call (host) intrinsics (which is perfectly valid)
 #if defined(_MSC_VER) && defined(__CUDACC__)
diff --git a/nanovdb/nanovdb/util/WenoStencil.h b/nanovdb/nanovdb/util/WenoStencil.h
new file mode 100644
index 0000000000..5f5ea7fc58
--- /dev/null
+++ b/nanovdb/nanovdb/util/WenoStencil.h
@@ -0,0 +1,364 @@
+// Copyright Contributors to the OpenVDB Project
+// SPDX-License-Identifier: Apache-2.0
+
+/*!
+    \file WenoStencil.h
+
+    \brief 19-point WENO5 stencil value container + out-of-band extrapolation +
+           fifth-order upwind Godunov's norm-square gradient.  Templated on
+           a ValueType.
+
+    `WenoStencil<ValueType>` holds the per-point values and activity flags.
+    ValueType is typically a raw scalar `float` (e.g. for scalar / CUDA code)
+    or a SIMD vector `Simd<float, W>` (CPU batch path).  The companion mask
+    type is auto-deduced (bool for raw scalars, ValueType::mask_type
+    otherwise):
+
+       ValueType values  [19]
+       MaskType  isActive[19]
+
+    For raw scalars the class reads as plain scalar arithmetic; for SIMD
+    vectors the same source compiles to whole-vector ops via Simd.h.  The
+    class is pure-compute -- the caller owns any fill-side C-array storage
+    and the per-point load step.
+
+    Grid-spacing scalars `mDx2` and `mInvDx2` stay scalar `float` and are
+    broadcast to ValueType at the point of use.
+
+    Operations provided:
+      - extrapolate(absBackground)
+            Repair out-of-band lanes (isActive[k] == false) via
+            copysign(absBackground, values[innerPoint]), processed in
+            ascending-|d| order so the inner point is already resolved.
+      - normSqGrad(isoValue = 0)
+            Godunov's norm-square of the fifth-order WENO upwind gradient.
+*/
+
+#pragma once
+
+#include <nanovdb/util/Simd.h>
+#include <nanovdb/math/Math.h>             // Pow2
+
+#include <cstdint>
+#include <tuple>
+#include <type_traits>
+#include <utility>
+
+namespace nanovdb {
+
+namespace detail {
+
+// ---------------------------------------------------------------------------
+// mask_of<T>::type -- auto-deduced predicate type for ValueType T:
+//   - T::mask_type if T has a nested mask_type (e.g. Simd<float,W>);
+//   - bool otherwise (e.g. raw scalar T=float).
+// ---------------------------------------------------------------------------
+template<typename T, typename = void>
+struct mask_of { using type = bool; };
+
+template<typename T>
+struct mask_of<T, std::void_t<typename T::mask_type>> {
+    using type = typename T::mask_type;
+};
+
+// ---------------------------------------------------------------------------
+// Generic WENO5 reconstruction -- templated on ValueType in {float,
+// Simd<float,W>}.  Nominally fifth-order finite-difference WENO (Chi-Wang
+// Shu, "High Order Finite Difference and Finite Volume WENO Schemes and
+// Discontinuous Galerkin Methods for CFD", ICASE Report No 2001-11 page 6;
+// see also ICASE 97-65 for a more complete reference, Shu 1997).
+//
+// Given v1=f(x-2dx), v2=f(x-dx), v3=f(x), v4=f(x+dx), v5=f(x+2dx), returns
+// an interpolated f(x+dx/2) with the property that
+// (f(x+dx/2)-f(x-dx/2))/dx = df/dx(x) + error, with error fifth-order in
+// smooth regions: O(dx) <= error <= O(dx^5).
+//
+// Body uses only primitives common to scalar and Simd ValueType
+// (operator+/-/*, math::Pow2), so the same source compiles in both modes
+// via the Simd backend in Simd.h.  Integer coefficients carry explicit
+// ValueType(...) casts for SIMD typed-operator dispatch, and float
+// literals carry .f suffix because stdx::simd's broadcast ctor rejects
+// double->float narrowing.
+//
+// ScalarType (defaults to float, deduced from scale2 if specified) is the
+// arithmetic precision of the reference-magnitude epsilon scaling.  scale2
+// stays scalar so callers can pass plain float/double grid constants
+// without broadcasting.
+// ---------------------------------------------------------------------------
+template<typename ValueType, typename ScalarType = float>
+__hostdev__ NANOVDB_FORCEINLINE ValueType
+WENO5(const ValueType& v1, const ValueType& v2, const ValueType& v3,
+      const ValueType& v4, const ValueType& v5,
+      ScalarType scale2 = ScalarType(1)) // openvdb uses scale2 = 0.01
+{
+    using math::Pow2;
+    const ValueType C = ValueType(13.f / 12.f);
+    // WENO is formulated for non-dimensional equations, here the optional scale2
+    // is a reference value (squared) for the function being interpolated.  For
+    // example if 'v' is of order 1000, then scale2 = 10^6 is ok.  But in practice
+    // leave scale2 = 1.
+    const ValueType eps = ValueType(ScalarType(1.e-6) * scale2);
+    // {\tilde \omega_k} = \gamma_k / (\beta_k + \epsilon)^2 in Shu's ICASE report.
+    const ValueType A1 = ValueType(0.1f) / Pow2(
+            C * Pow2(v1 - ValueType(2)*v2 + v3) +
+            ValueType(0.25f) * Pow2(v1 - ValueType(4)*v2 + ValueType(3)*v3) + eps);
+    const ValueType A2 = ValueType(0.6f) / Pow2(
+            C * Pow2(v2 - ValueType(2)*v3 + v4) +
+            ValueType(0.25f) * Pow2(v2 - v4) + eps);
+    const ValueType A3 = ValueType(0.3f) / Pow2(
+            C * Pow2(v3 - ValueType(2)*v4 + v5) +
+            ValueType(0.25f) * Pow2(ValueType(3)*v3 - ValueType(4)*v4 + v5) + eps);
+
+    return (A1 * (ValueType(2)*v1 - ValueType(7)*v2 + ValueType(11)*v3) +
+            A2 * (ValueType(5)*v3 -              v2 + ValueType( 2)*v4) +
+            A3 * (ValueType(2)*v3 + ValueType(5)*v4 -               v5))
+         / (ValueType(6) * (A1 + A2 + A3));
+}
+
+// ---------------------------------------------------------------------------
+// Generic Godunov's norm-square gradient -- templated on ValueType
+// in {float, Simd<float,W>} and a companion MaskType (bool for raw scalar,
+// ValueType::mask_type for Simd).  Differs from a textbook scalar form in
+// shape: instead of an if/else on isOutside we compute both branches
+// unconditionally and blend via math::Select, so the SIMD path has no
+// control-flow divergence across lanes.  At ValueType=float the scalar
+// math::Select(bool, ValueType, ValueType) overload reduces this to the
+// equivalent if/else semantics.
+// ---------------------------------------------------------------------------
+template<typename ValueType, typename MaskType>
+__hostdev__ NANOVDB_FORCEINLINE ValueType
+GodunovsNormSqrd(MaskType isOutside,
+                 ValueType dP_xm, ValueType dP_xp,
+                 ValueType dP_ym, ValueType dP_yp,
+                 ValueType dP_zm, ValueType dP_zp)
+{
+    using math::Min; using math::Max; using math::Pow2; using math::Select;
+    const ValueType zero(0.f);
+
+    const ValueType outside = Max(Pow2(Max(dP_xm, zero)), Pow2(Min(dP_xp, zero)))  // (dP/dx)^2
+                            + Max(Pow2(Max(dP_ym, zero)), Pow2(Min(dP_yp, zero)))  // (dP/dy)^2
+                            + Max(Pow2(Max(dP_zm, zero)), Pow2(Min(dP_zp, zero))); // (dP/dz)^2
+
+    const ValueType inside  = Max(Pow2(Min(dP_xm, zero)), Pow2(Max(dP_xp, zero)))
+                            + Max(Pow2(Min(dP_ym, zero)), Pow2(Max(dP_yp, zero)))
+                            + Max(Pow2(Min(dP_zm, zero)), Pow2(Max(dP_zp, zero)));
+
+    return Select(isOutside, outside, inside); // |\nabla\phi|^2
+}
+
+} // namespace detail
+
+// ---------------------------------------------------------------------------
+// WenoStencil<ValueType> -- pure-compute container for a 19-point WENO5
+// stencil state.  Holds ValueType-typed values + MaskType-typed activity
+// flags + scalar grid constants.  Fill-side responsibility (scalar writes
+// into any raw buffers, followed by a per-point load into this stencil's
+// values[] / isActive[]) lives in the caller.  See WenoStencil.md
+// for usage patterns.
+// ---------------------------------------------------------------------------
+template<typename ValueType>
+class WenoStencil
+{
+public:
+    using MaskType = typename detail::mask_of<ValueType>::type;
+
+    // --- Stencil-point offset types (compile-time only) -------------------
+    // StencilPoint<i,j,k> carries the offset as a type.  StencilPoints is
+    // the 19-point tuple in the canonical idx ordering:
+    //    idx  0     : center  < 0, 0, 0>
+    //    idx  1.. 6 : x-axis  <-3,0,0> <-2,0,0> <-1,0,0> <+1,0,0> <+2,0,0> <+3,0,0>
+    //    idx  7..12 : y-axis  <0,-3,0> <0,-2,0> <0,-1,0> <0,+1,0> <0,+2,0> <0,+3,0>
+    //    idx 13..18 : z-axis  <0,0,-3> <0,0,-2> <0,0,-1> <0,0,+1> <0,0,+2> <0,0,+3>
+    template<int i, int j, int k>
+    struct StencilPoint {
+        static constexpr int di = i, dj = j, dk = k;
+    };
+
+    using StencilPoints = std::tuple<
+        StencilPoint< 0, 0, 0>,
+        StencilPoint<-3, 0, 0>, StencilPoint<-2, 0, 0>, StencilPoint<-1, 0, 0>,
+        StencilPoint<+1, 0, 0>, StencilPoint<+2, 0, 0>, StencilPoint<+3, 0, 0>,
+        StencilPoint< 0,-3, 0>, StencilPoint< 0,-2, 0>, StencilPoint< 0,-1, 0>,
+        StencilPoint< 0,+1, 0>, StencilPoint< 0,+2, 0>, StencilPoint< 0,+3, 0>,
+        StencilPoint< 0, 0,-3>, StencilPoint< 0, 0,-2>, StencilPoint< 0, 0,-1>,
+        StencilPoint< 0, 0,+1>, StencilPoint< 0, 0,+2>, StencilPoint< 0, 0,+3>
+    >;
+
+    static constexpr int SIZE = int(std::tuple_size_v<StencilPoints>);
+    static constexpr int size() { return SIZE; }
+
+    // Compute-side storage.  At ValueType=float these are plain float / bool
+    // arrays; at ValueType=Simd<float,W> they are whole-vector arrays.
+    ValueType values  [SIZE];
+    MaskType  isActive[SIZE];
+
+    // Runtime grid-spacing constants -- plain scalars regardless of ValueType,
+    // broadcast inside normSqGrad().
+    float mDx2{1.f};       // dx^2      -- fed to WENO5's epsilon via scale2
+    float mInvDx2{1.f};    // 1 / dx^2  -- final normalisation in normSqGrad
+
+    __hostdev__ WenoStencil() = default;
+    __hostdev__ explicit WenoStencil(float dx)
+        : mDx2(dx * dx), mInvDx2(1.f / (dx * dx)) {}
+
+    // Compile-time named-point access: returns the index of point (i,j,k)
+    // in the StencilPoints tuple.
+    template<int i, int j, int k>
+    static constexpr int pointIndex()
+    {
+        constexpr int I = findPoint<i, j, k>(std::make_index_sequence<SIZE>{});
+        static_assert(I >= 0, "WenoStencil::pointIndex: point not in stencil");
+        return I;
+    }
+
+    // Resolve all SIZE stencil-point indices for the voxel at @a center via
+    // Acc::getValue(center + offset).  Indices land in out[0..SIZE-1] in the
+    // StencilPoints tuple ordering.  Acc is any NanoVDB accessor whose
+    // getValue() returns a value convertible to uint64_t (e.g. ValueOnIndex's
+    // sequential active-voxel indices).  The accessor's path cache is reused
+    // across the SIZE getValue calls.
+    template<typename Acc>
+    static void gatherIndices(Acc& acc, const Coord& center, uint64_t* out)
+    {
+        gatherIndicesImpl(acc, center, out, std::make_index_sequence<SIZE>{});
+    }
+
+    // ------------------------------------------------------------------
+    // extrapolate -- sign-correct out-of-band lanes (isActive[k][i] == false)
+    // of values[k] by multiplying with Sign(values[innerPoint][i]).  Active
+    // lanes are preserved.  Center point (idx 0) is assumed always in-band
+    // and is not processed.
+    //
+    // Convention: the caller must pre-load inactive lanes of values[k] with
+    // the sidecar slot-0 background value (which the standard NanoVDB fill
+    // pattern produces automatically: a missing tap resolves to index 0,
+    // and sidecar[0] is the background).  This routine then flips the sign
+    // when the parent (innerPoint) is negative, leaves it alone when the
+    // parent is positive, and zeros the lane when the parent is exactly 0.
+    //
+    // Processes 18 (point, innerPoint) pairs in ascending-|d| order so the
+    // inner point is already resolved when the outer point is touched;
+    // sign-inheritance through |d|=1 -> |d|=2 -> |d|=3 is automatic.
+    // ------------------------------------------------------------------
+    __hostdev__ NANOVDB_FORCEINLINE void extrapolate();
+
+    // ------------------------------------------------------------------
+    // normSqGrad -- Godunov's norm-square of the fifth-order WENO upwind
+    // gradient at the stencil center.  Returns |\nabla\phi|^2.
+    //
+    // Six axial WENO5 reconstructions (one pair +/-x, +/-y, +/-z), then
+    // Godunov's upwind combinator driven by the sign of (center - iso).
+    //
+    // Call only after the stencil has been populated (see usage pattern in
+    // WenoStencil.md).  extrapolate() before normSqGrad() is the
+    // typical pipeline shape but is not required by this method.
+    // ------------------------------------------------------------------
+    __hostdev__ NANOVDB_FORCEINLINE ValueType normSqGrad(float iso = 0.f) const;
+
+private:
+    // Compile-time inverse map: (i,j,k) -> slot index in StencilPoints.
+    // Returns -1 if no matching point exists; pointIndex() turns that into
+    // a static_assert.
+    template<int i, int j, int k, size_t... Is>
+    static constexpr int findPoint(std::index_sequence<Is...>)
+    {
+        int result = -1;
+        ((std::tuple_element_t<Is, StencilPoints>::di == i &&
+          std::tuple_element_t<Is, StencilPoints>::dj == j &&
+          std::tuple_element_t<Is, StencilPoints>::dk == k &&
+          result < 0 ? (result = int(Is)) : 0), ...);
+        return result;
+    }
+
+    // Parameter-pack expansion driving gatherIndices(): unrolls SIZE getValue
+    // calls into a single fold expression.
+    template<typename Acc, size_t... Is>
+    static void gatherIndicesImpl(Acc& acc, const Coord& center, uint64_t* out,
+                                   std::index_sequence<Is...>)
+    {
+        ((out[Is] = static_cast<uint64_t>(acc.getValue(center + Coord(
+            std::tuple_element_t<Is, StencilPoints>::di,
+            std::tuple_element_t<Is, StencilPoints>::dj,
+            std::tuple_element_t<Is, StencilPoints>::dk)))), ...);
+    }
+
+    // Hardcoded (point, innerPoint) pairs for the 19-point StencilPoints
+    // tuple, ordered by ascending |d| so the inner point is always already
+    // resolved when the outer point is processed.  Indices match the
+    // StencilPoints tuple above.
+    //
+    //   idx  0     :  center     ( 0, 0, 0)
+    //   idx  1.. 6 :  x-axis     (-3..+3 in the order -3,-2,-1,+1,+2,+3)
+    //   idx  7..12 :  y-axis     (-3..+3)
+    //   idx 13..18 :  z-axis     (-3..+3)
+    static constexpr int kNumPairs = 18;
+    static constexpr int kPairs[kNumPairs][2] = {
+        // |d|=1  (inner point = center, idx 0)
+        { 3, 0}, { 4, 0},        // x: -1, +1
+        { 9, 0}, {10, 0},        // y: -1, +1
+        {15, 0}, {16, 0},        // z: -1, +1
+        // |d|=2  (inner point = |d|=1 on same axis)
+        { 2, 3}, { 5, 4},        // x: -2 <- (-1),  +2 <- (+1)
+        { 8, 9}, {11, 10},       // y
+        {14, 15}, {17, 16},      // z
+        // |d|=3  (inner point = |d|=2 on same axis)
+        { 1, 2}, { 6, 5},        // x: -3 <- (-2),  +3 <- (+2)
+        { 7, 8}, {12, 11},       // y
+        {13, 14}, {18, 17}       // z
+    };
+};
+
+// ---------------------------------------------------------------------------
+// extrapolate -- single-source implementation.  At ValueType=float this is
+// scalar code; at ValueType=Simd<float,W> the same source compiles to
+// whole-SIMD blends via the math::Select dispatch.
+// ---------------------------------------------------------------------------
+template<typename ValueType>
+__hostdev__ NANOVDB_FORCEINLINE void
+WenoStencil<ValueType>::extrapolate()
+{
+    const ValueType zero(0.f);
+
+    for (int p = 0; p < kNumPairs; ++p) {
+        const int k      = kPairs[p][0];
+        const int kInner = kPairs[p][1];
+
+        // values[k] *= Sign(values[kInner]):
+        //   parent > 0  -> values[k] (already pre-loaded with +background);
+        //   parent < 0  -> -values[k];
+        //   parent == 0 -> 0.
+        const MaskType  isPosParent = values[kInner] > zero;
+        const MaskType  isNegParent = values[kInner] < zero;
+        const ValueType signed_k    = math::Select(isPosParent, values[k],
+                                      math::Select(isNegParent, -values[k], zero));
+
+        // Active lanes keep their own value; inactive lanes get the sign-corrected background.
+        values[k] = math::Select(isActive[k], values[k], signed_k);
+    }
+}
+
+// ---------------------------------------------------------------------------
+// normSqGrad -- Godunov's upwind WENO norm-square gradient.  Six axial
+// WENO5 reconstructions drive GodunovsNormSqrd; point indices 0..18 follow
+// the StencilPoints tuple ordering above.  mInvDx2 and iso are broadcast to
+// ValueType at the final combinator.
+// ---------------------------------------------------------------------------
+template<typename ValueType>
+__hostdev__ NANOVDB_FORCEINLINE ValueType
+WenoStencil<ValueType>::normSqGrad(float iso) const
+{
+    const ValueType* v = values;
+
+    const ValueType dP_xm = detail::WENO5<ValueType>(v[ 2]-v[ 1], v[ 3]-v[ 2], v[ 0]-v[ 3], v[ 4]-v[ 0], v[ 5]-v[ 4], mDx2);
+    const ValueType dP_xp = detail::WENO5<ValueType>(v[ 6]-v[ 5], v[ 5]-v[ 4], v[ 4]-v[ 0], v[ 0]-v[ 3], v[ 3]-v[ 2], mDx2);
+    const ValueType dP_ym = detail::WENO5<ValueType>(v[ 8]-v[ 7], v[ 9]-v[ 8], v[ 0]-v[ 9], v[10]-v[ 0], v[11]-v[10], mDx2);
+    const ValueType dP_yp = detail::WENO5<ValueType>(v[12]-v[11], v[11]-v[10], v[10]-v[ 0], v[ 0]-v[ 9], v[ 9]-v[ 8], mDx2);
+    const ValueType dP_zm = detail::WENO5<ValueType>(v[14]-v[13], v[15]-v[14], v[ 0]-v[15], v[16]-v[ 0], v[17]-v[16], mDx2);
+    const ValueType dP_zp = detail::WENO5<ValueType>(v[18]-v[17], v[17]-v[16], v[16]-v[ 0], v[ 0]-v[15], v[15]-v[14], mDx2);
+
+    return ValueType(mInvDx2) *
+           detail::GodunovsNormSqrd<ValueType, MaskType>(v[0] > ValueType(iso),
+                                                          dP_xm, dP_xp, dP_ym, dP_yp, dP_zm, dP_zp);
+}
+
+} // namespace nanovdb
diff --git a/nanovdb/nanovdb/util/WenoStencil.md b/nanovdb/nanovdb/util/WenoStencil.md
new file mode 100644
index 0000000000..cedd05f63a
--- /dev/null
+++ b/nanovdb/nanovdb/util/WenoStencil.md
@@ -0,0 +1,434 @@
+# WenoStencil — Single-Source Scalar/SIMD WENO5 Stencil Value Container
+
+Design reference for `nanovdb/nanovdb/util/WenoStencil.h`.  Captures the
+rationale behind the templated-on-lane-width class, the out-of-band
+extrapolation algorithm, and the Godunov norm-square-gradient method.
+
+---
+
+## 1.  Motivation
+
+The WENO5 CPU pipeline assembles, per voxel batch, a 19-tap value
+matrix with per-tap activity flags:
+
+```
+float  values[Ntaps][W]     -- real sidecar value, or background for OOB lanes
+bool   isActive[Ntaps][W]   -- true iff the tap voxel is in the narrow band
+```
+
+Downstream phases are (1) **extrapolation** to repair out-of-band lanes
+with sign-corrected background, and (2) **WENO5 arithmetic** (the fifth-
+order upwind Godunov norm-square-gradient) to produce a per-voxel
+`|∇φ|²` scalar.
+
+`WenoStencil<W>` encapsulates the compute state and both operations:
+
+1. Storage of the 19-tap × W-lane Simd values + activity masks.
+2. `extrapolate()` — ascending-|Δ| cascade, hardcoded tap pairs.
+3. `normSqGrad()` — six axial WENO5 reconstructions + Godunov combinator.
+4. A single-source pattern that keeps GPU (W=1, per-thread) code
+   textually identical to CPU SIMD (W>1) code — only the compile-time
+   lane width changes.
+
+---
+
+## 2.  Single-source scalar/SIMD design
+
+### 2.1  `Simd<T, W>` as the storage type
+
+Storage is `Simd<float, W>` / `SimdMask<float, W>` directly — *not* raw
+`float[W]` / `bool[W]` arrays:
+
+```cpp
+template<int W = 1>
+class WenoStencil
+{
+public:
+    using FloatV = util::Simd    <float, W>;
+    using MaskV  = util::SimdMask<float, W>;
+
+    FloatV values  [SIZE];
+    MaskV  isActive[SIZE];
+
+    float mDx2{1.f};       // dx²      — scalar, broadcast on use
+    float mInvDx2{1.f};    // 1 / dx²  — scalar, broadcast on use
+    /* ... extrapolate(), normSqGrad() ... */
+};
+```
+
+At W=1 the Simd types collapse to plain `float` / `bool` under the
+array backend.  At W>1 they are the backend-native SIMD type (stdx or
+array wrapper).  Memory layout at W>1 is identical to `float[SIZE][W]`
+directly, so there is no storage-cost penalty — just a cleaner
+compute-side type.
+
+### 2.2  Why first-class Simd storage (vs raw C arrays + `addr()` bridge)
+
+An earlier version used `std::conditional_t<W==1, float, float[W]>` as
+the element type, with an `addr()` helper to normalize the W=1 scalar-
+reference vs W>1 array-decay at every SIMD load/store site.  That
+design was rejected in favour of Simd-typed storage for three reasons:
+
+- **W=1 ceremony**.  Approach 1 forced the scalar case to read
+  `FloatV val(addr(mValues[k]), element_aligned)` — loading a
+  `Simd<float, 1>` from a scalar reference.  On the CUDA per-thread
+  path (where the scalar case is the *production* pipeline, not a
+  degenerate convenience) this ceremony survives into every
+  `__hostdev__` method that reads the stencil.  Under Simd-typed
+  storage, W=1 reads `FloatV val = values[k]` — pure scalar code.
+
+- **Arithmetic boundary ceremony**.  Approach 1 made `extrapolate()`
+  and a prospective `normSqGrad()` bracket every read/write with an
+  explicit Simd load or store.  With Simd-typed storage, the
+  arithmetic reads as if scalar (`values[k] = util::where(...)`) and
+  the load/store boundary moves out to the caller where the Simd
+  values meet raw fill-side buffers.
+
+- **Symmetric, explicit boundary placement**.  The caller already owns
+  the fill-side scalar-scatter loop (sidecar `sidecar[idx]` gathers are
+  inherently scalar-indexed per lane).  Making the array→Simd
+  conversion an explicit caller-side step (`FloatV(raw_values[k],
+  element_aligned)`) preserves that ownership — the arithmetic class
+  doesn't care where its data came from.
+
+### 2.3  Scalar runtime constants, broadcast on use
+
+`mDx2` and `mInvDx2` stay plain `float` at every W.  They are
+broadcast to `FloatV` at the point of use inside `normSqGrad()`:
+
+```cpp
+return FloatV(mInvDx2) * detail::GodunovsNormSqrd<FloatV, MaskV>(...);
+```
+
+`vbroadcastss` is free on x86 (folds into the FMA consumer); identity
+at W=1.  Storing these as `Simd<float, W>` instead would cost 64
+bytes × 2 of storage and hold two YMM registers across the entire
+kernel lifetime for no benefit.
+
+### 2.4  Caller-owned fill-side buffers
+
+The class has **no** fill-side storage — no `mValues`/`mIsActive` raw C
+arrays.  Callers own whatever shape of raw data is natural for them.
+For the CPU SIMD case that's typically a pair of stack-local
+`alignas(64)` C arrays sized `[SIZE][W]`; for the CUDA per-thread case
+no intermediate buffer is needed at all.
+
+This preserves the arithmetic class's purity and gives callers flex-
+ibility — a different Phase-2 path (e.g. a future hardware-gather
+fill) can populate the stencil using whatever pattern fits, without
+the class having to expose a "fill API" that bakes in one shape.
+
+---
+
+## 3.  Extrapolation semantics
+
+### 3.1  The out-of-band problem
+
+For a narrow-band SDF, only the center tap `<0,0,0>` is guaranteed to
+be in the active narrow band.  Every other tap may land outside the
+band for some lanes — for those lanes the sidecar fill writes
+`values[k][lane] = sidecar[0]` (magnitude of the background, since the
+sidecar builder pre-sets slot 0 to `floatGrid.background()`) and
+`isActive[k][lane] = false`.
+
+Applying WENO5 arithmetic directly to the unsigned `|background|`
+magnitude produces wrong gradients at the band boundary: the
+reconstructed field would not track the sign of the underlying signed-
+distance function.  The standard fix is to **extrapolate from the next
+inner tap's sign**:
+
+```
+if (!isActive[k][lane])
+    values[k][lane] = copysign(|background|, values[innerTap][lane])
+```
+
+The `|background|` magnitude is preserved; the sign is copied from
+whichever "inner" tap (one step closer to center along the same axis)
+best represents which side of the surface the lane belongs to.
+
+### 3.2  Inner-tap cascade — ascending |Δ| order
+
+| Outer tap |Δ| | Inner tap (source of sign) |
+|---|---|
+| `<±1,0,0>`, `<0,±1,0>`, `<0,0,±1>` | center `<0,0,0>` |
+| `<±2,0,0>`, `<0,±2,0>`, `<0,0,±2>` | `<±1,0,0>`, `<0,±1,0>`, `<0,0,±1>` |
+| `<±3,0,0>`, `<0,±3,0>`, `<0,0,±3>` | `<±2,0,0>`, `<0,±2,0>`, `<0,0,±2>` |
+
+Processing taps in ascending-|Δ| order guarantees the inner tap is
+already resolved (real value or previously extrapolated) when the
+outer tap is processed.  Sign propagation through a |Δ|=1 → |Δ|=2 →
+|Δ|=3 chain is transitive — no special casing.
+
+### 3.3  The `kPairs[]` table
+
+The inner-tap relationship is WENO5-specific and hardcoded as a static
+table inside the class:
+
+```cpp
+static constexpr int kPairs[18][2] = {
+    // |Δ|=1 (inner = center, idx 0)
+    {3,0},{4,0},{9,0},{10,0},{15,0},{16,0},
+    // |Δ|=2 (inner = |Δ|=1 on same axis)
+    {2,3},{5,4},{8,9},{11,10},{14,15},{17,16},
+    // |Δ|=3 (inner = |Δ|=2 on same axis)
+    {1,2},{6,5},{7,8},{12,11},{13,14},{18,17}
+};
+```
+
+Indices match the `WenoStencil<W>::Taps` tuple defined in
+`WenoStencil.h` (same ordering as `WenoPt<i,j,k>::idx` in
+`nanovdb/math/Stencils.h`).  Center tap (idx 0) is not processed —
+assumed always in-band.
+
+**Why hardcoded, not template-derived:**  a generic scheme would walk
+`Taps` at compile time and derive inner-tap indices from |Δ| and axis
+alignment.  For a single stencil the table is 18 entries, reads
+directly, and makes the cascade ordering self-documenting.  Worth
+revisiting if we add Weno7 or other axis-aligned WENO variants.
+
+### 3.4  `extrapolate()` implementation
+
+```cpp
+template<int W>
+void WenoStencil<W>::extrapolate(float absBackground)
+{
+    const FloatV absBg(absBackground);
+    const FloatV zero (0.f);
+
+    for (int p = 0; p < kNumPairs; ++p) {
+        const int k      = kPairs[p][0];
+        const int kInner = kPairs[p][1];
+
+        // copysign(absBg, inner): +absBg if inner >= 0, else -absBg.
+        const MaskV  isNegInner = zero > values[kInner];
+        const FloatV extrap     = util::where(isNegInner, -absBg, absBg);
+
+        // Active lanes keep their own value; inactive take the extrapolated sign-corrected background.
+        values[k] = util::where(isActive[k], values[k], extrap);
+    }
+}
+```
+
+No `addr()`.  No `element_aligned`.  Reads `values[]` as Simd,
+operates as Simd, writes Simd — the kernel body never drops to the
+underlying scalar/array representation.
+
+**Per-pair cost (W=16, AVX2):**
+
+| Op | Cycles (est.) |
+|----|--------------:|
+| compare `zero > values[kInner]` | 1 |
+| `where(isNegInner, -absBg, absBg)` (vblendvps) | 1 |
+| `where(isActive[k], values[k], extrap)` (mask convert + vblendvps) | 1–2 |
+| **≈ 4 cycles / pair** (values[] register-resident) |
+
+Total: 18 pairs × ~4 cycles = ~72 cycles per call — lower than the
+Approach-1 estimate of ~7 cycles/pair, because we no longer do the
+explicit per-pair load.  The Simd values live in YMM registers across
+the pair loop.  Measured end-to-end cost (sidecar-stencil-extrap minus
+sidecar-stencil): ~0.14–0.19 ns/voxel on 24 threads.
+
+---
+
+## 4.  Godunov norm-square-gradient
+
+### 4.1  Semantics — tracking the ground-truth scalar
+
+`nanovdb::math::WenoStencil<GridT>::normSqGrad(isoValue)` in
+`nanovdb/math/Stencils.h` is the ground-truth scalar reference.  Its
+body:
+
+```cpp
+const ValueType* v = mValues;
+const RealT
+    dP_xm = WENO5<RealT>(v[2]-v[1], v[3]-v[2], v[0]-v[3], v[4]-v[0], v[5]-v[4], mDx2),
+    dP_xp = WENO5<RealT>(v[6]-v[5], v[5]-v[4], v[4]-v[0], v[0]-v[3], v[3]-v[2], mDx2),
+    dP_ym = ..., dP_yp = ..., dP_zm = ..., dP_zp = ...;
+return mInvDx2 * GodunovsNormSqrd(v[0] > isoValue, dP_xm, dP_xp, dP_ym, dP_yp, dP_zm, dP_zp);
+```
+
+`WenoStencil<W>::normSqGrad(iso)` is a line-for-line transliteration of
+the same body, with three adaptations:
+
+1. `v = values` (Simd-typed storage, not `mValues` scalar array).
+2. Local `dP_*` are `FloatV` rather than `RealT`.
+3. The final `mInvDx2 * ...` multiplication broadcasts the scalar
+   `mInvDx2` to `FloatV` (via `FloatV(mInvDx2)`); at W=1 this is a
+   no-op.
+
+### 4.2  `WENO5<T>` — generic over scalar and Simd
+
+The six axial reconstructions are driven by a single free-function
+template `nanovdb::detail::WENO5<T, RealT=T>` that mirrors
+`nanovdb::math::WENO5<ValueType, RealT>` exactly (Shu ICASE
+smoothness indicators, 0.1/0.6/0.3 linear weights, static_cast at the
+end replaced by the trailing division).  Structure is the same; only
+the literal constants are wrapped in `RealT(...)` constructors to
+broadcast at W>1.  Lives in `nanovdb::detail` to keep the naming
+convention close to the ground-truth without colliding with the
+existing `nanovdb::math::WENO5`.
+
+### 4.3  `GodunovsNormSqrd<T, MaskT>` — `where`-based, no control flow
+
+The scalar ground-truth has a runtime `if (isOutside) { … } else { … }`.
+The generic-T version computes both branches unconditionally and
+blends via `util::where`:
+
+```cpp
+template<typename T, typename MaskT>
+inline T GodunovsNormSqrd(MaskT isOutside,
+                          T dP_xm, T dP_xp, T dP_ym, T dP_yp, T dP_zm, T dP_zp)
+{
+    const T zero(0.f);
+    const T outside = max(Pow2(max(dP_xm, zero)), Pow2(min(dP_xp, zero)))   // (dP/dx)²
+                    + max(Pow2(max(dP_ym, zero)), Pow2(min(dP_yp, zero)))
+                    + max(Pow2(max(dP_zm, zero)), Pow2(min(dP_zp, zero)));
+    const T inside  = max(Pow2(min(dP_xm, zero)), Pow2(max(dP_xp, zero)))
+                    + max(Pow2(min(dP_ym, zero)), Pow2(max(dP_yp, zero)))
+                    + max(Pow2(min(dP_zm, zero)), Pow2(max(dP_zp, zero)));
+    return where(isOutside, outside, inside);
+}
+```
+
+At `T=float, MaskT=bool` this compiles to scalar code with both
+branches speculatively evaluated — slightly slower than the
+ground-truth's branchy form for scalar workloads in isolation, but
+identical correctness.  At `T=Simd<float,W>, MaskT=SimdMask<float,W>`
+the branches are unconditional SIMD compute plus a `vblendvps` —
+no lane-divergent branches, no scalarisation.
+
+Per-lane cost of the full `normSqGrad`:
+
+| Phase | Ops |
+|-------|-----|
+| 6× axial WENO5 | ~60 mul/add/fma + 6× reciprocals (the `0.N / Pow2(…)` terms) |
+| Godunov: 12× max/min + 12× mul + 5× add, both branches | ~29 ops |
+| Blend + final multiply by FloatV(mInvDx2) | 2 ops |
+
+Roughly ~100 arithmetic ops per voxel per `normSqGrad` call.  At W=16
+AVX2 that's ~100 / 16 ≈ 6.3 cycles/voxel × some FMA-throughput factor
+— call it 2 ns/voxel single-threaded, 0.1 ns/voxel on 24 threads.
+(To be validated by measurement; see §7.1 Future work.)
+
+---
+
+## 5.  API usage
+
+### 5.1  CPU SIMD — caller-owned raw buffers, explicit load
+
+```cpp
+// Caller owns its scalar-scatter target.
+alignas(64) float raw_values[SIZE][W];
+alignas(64) bool  raw_active[SIZE][W];
+
+nanovdb::WenoStencil<W> stencil(dx);
+
+// Fill — pure scalar stores, guaranteed fast codegen on all backends.
+for (int k = 0; k < SIZE; ++k) {
+    for (int i = 0; i < W; ++i) {
+        const uint64_t idx = /* sidecar index for tap k, lane i */;
+        raw_values[k][i] = sidecar[idx];
+        raw_active[k][i] = (idx != 0);
+    }
+}
+
+// Bridge — one SIMD load per tap.
+for (int k = 0; k < SIZE; ++k) {
+    stencil.values  [k] = FloatV(raw_values[k], util::element_aligned);
+    stencil.isActive[k] = MaskV (raw_active[k], util::element_aligned);
+}
+
+// Arithmetic — reads/writes stencil.values[] as Simd in place.
+stencil.extrapolate(std::abs(sidecar[0]));
+FloatV normSq = stencil.normSqGrad(/* iso = */ 0.f);
+
+// Simd → scalar bridge at the output side if downstream consumers are scalar.
+alignas(64) float normSq_lanes[W];
+util::store(normSq, normSq_lanes, util::element_aligned);
+```
+
+### 5.2  CUDA scalar — no intermediate buffer
+
+```cpp
+nanovdb::WenoStencil<1> stencil(dx);
+for (int k = 0; k < SIZE; ++k) {
+    const uint64_t idx = gather_index_for_tap(k);
+    stencil.values  [k] = sidecar[idx];
+    stencil.isActive[k] = (idx != 0);
+}
+
+stencil.extrapolate(fabsf(sidecar[0]));
+float normSq = stencil.normSqGrad();
+```
+
+`FloatV` is `float` at W=1; direct scalar assignment.  `MaskV` is
+`bool`.  No raw buffers, no `element_aligned`, no load loops — the
+per-thread path reads as pure scalar arithmetic.
+
+### 5.3  Compile-time named-tap access
+
+```cpp
+constexpr int ctr = WenoStencil<W>::tapIndex<0, 0, 0>();
+FloatV centerValue = stencil.values[ctr];
+
+constexpr int xm3 = WenoStencil<W>::tapIndex<-3, 0, 0>();
+FloatV xm3Value   = stencil.values[xm3];
+```
+
+`tapIndex<DI,DJ,DK>()` forwards to a private static `findTap` helper
+inside `WenoStencil<W>`, static-asserting at compile time that the
+requested tap exists in the `Taps` tuple.
+
+---
+
+## 6.  Ownership boundaries
+
+```
+┌───────────────────────────────────────────────────────────────────┐
+│ Caller                                                            │
+│   alignas(64) float raw_values[SIZE][W];    ← fill-side buffer    │
+│   alignas(64) bool  raw_active[SIZE][W];                          │
+│                                                                   │
+│   <scalar scatter fill>                                           │
+│                                                                   │
+│   for k: stencil.values[k]   = FloatV(raw_values[k], ...);        │
+│          stencil.isActive[k] = MaskV (raw_active[k], ...);        │
+│   ═══════════════════════════════════════════════════ Simd border │
+├───────────────────────────────────────────────────────────────────┤
+│ WenoStencil<W>                                                    │
+│   FloatV values  [19];   MaskV isActive[19];                      │
+│   float  mDx2, mInvDx2;                                           │
+│   extrapolate() / normSqGrad() — Simd-in / Simd-out, pure compute │
+├───────────────────────────────────────────────────────────────────┤
+│ Caller                                                            │
+│   ═══════════════════════════════════════════════════ Simd border │
+│   util::store(normSq, normSq_lanes, util::element_aligned);       │
+│   <per-lane scalar write to output sidecar>                       │
+└───────────────────────────────────────────────────────────────────┘
+```
+
+Array↔Simd bridges exist only at the two explicit boundaries where
+scalar-indexed I/O meets SIMD-parallel compute.  Inside `WenoStencil`
+everything is Simd; outside the class the caller chooses whatever
+scalar pattern fits its source/sink.
+
+---
+
+## 7.  Future work
+
+### 7.1  Alternative stencils
+
+If/when Weno7 or a non-axis-aligned stencil is needed, the class
+would specialise on a stencil-policy template parameter rather than
+hardcoding the 19-tap WENO5 shape:
+
+```cpp
+template<typename StencilPolicy, int W>
+class AxisAlignedStencil { /* derive kPairs at compile time */ };
+```
+
+The `kPairs` table would be generated from `StencilPolicy::Taps` via a
+constexpr pass that finds, for each tap, the same-axis neighbour with
+|Δ| = |tap.Δ| − 1.  Not needed until a second axis-aligned stencil
+exists.
+