diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000000..7dc163fd70 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,119 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Build System + +OpenVDB uses CMake (minimum 3.24) and requires out-of-source builds. + +**Minimal core build:** +```bash +mkdir build && cd build +cmake .. -DOPENVDB_BUILD_UNITTESTS=ON +make -j$(nproc) +``` + +**Using the CI build script (recommended for full builds):** +```bash +./ci/build.sh --build-type=Release \ + --components="core,test" \ + --cargs="-DOPENVDB_ABI_VERSION_NUMBER=13" +``` + +Component flags for `--components`: `core`, `python`, `bin`, `view`, `render`, `test`, `hou`, `axcore`, `nano`, `nanotest` + +**Key CMake options:** +| Option | Default | Description | +|--------|---------|-------------| +| `OPENVDB_BUILD_CORE` | ON | Core library | +| `OPENVDB_BUILD_UNITTESTS` | OFF | Unit tests | +| `OPENVDB_BUILD_NANOVDB` | OFF | NanoVDB | +| `OPENVDB_BUILD_AX` | OFF | OpenVDB AX | +| `OPENVDB_ABI_VERSION_NUMBER` | 13 | ABI version (6–13) | +| `OPENVDB_CXX_STRICT` | OFF | Strict warnings | +| `NANOVDB_USE_CUDA` | OFF | CUDA support for NanoVDB | + +## Running Tests + +```bash +cd build +ctest -V # all tests +ctest -V -R TestGrid # single test by name +``` + +To build only specific unit tests (avoids full rebuild): +```bash +cmake .. -DOPENVDB_TESTS="Grid;Tree;LeafNode" +``` + +Tests use Google Test (minimum 1.10). Test sources live in: +- `openvdb/openvdb/unittest/` — core library tests (`TestFoo.cc` pattern) +- `nanovdb/nanovdb/unittest/` — NanoVDB tests + +## Code Architecture + +### Repository Layout + +``` +openvdb/openvdb/ Core OpenVDB library + tree/ Tree node hierarchy (RootNode, InternalNode, LeafNode) + tools/ Algorithm implementations (level sets, CSG, smoothing, etc.) + math/ Math primitives (Vec, Mat, Quat, Transform, BBox) + io/ VDB file format I/O + points/ Point data grids + python/ Python bindings (nanobind) + unittest/ Unit tests + +nanovdb/nanovdb/ NanoVDB — compact, GPU-friendly VDB subset + tools/ CPU algorithms + tools/cuda/ CUDA kernels + examples/ Standalone example programs + +openvdb_ax/openvdb_ax/ OpenVDB AX — JIT expression language for VDB operations + ast/ Abstract syntax tree + codegen/ LLVM code generation + compiler/ Compilation pipeline + +openvdb_cmd/ Command-line tools (vdb_print, vdb_lod, vdb_tool, vdb_view, vdb_render) +openvdb_houdini/ Houdini plugin +openvdb_maya/ Maya plugin +cmake/ CMake find-modules and configuration +ci/ CI build/install scripts +``` + +### Core Data Model + +OpenVDB uses a **B+tree-like hierarchical sparse data structure**: +- `Grid` — top-level container with transform and metadata +- `Tree` — composed of `RootNode → InternalNode(s) → LeafNode` +- Leaf nodes are 8×8×8 voxel blocks; internal nodes are 16³ and 32³ by default +- `ValueAccessor` caches tree traversal paths for repeated access patterns +- `GridBase` / `TypedGrid` provide the runtime-polymorphic/compile-time-typed split + +### NanoVDB vs OpenVDB + +NanoVDB is a read-optimized, single-allocation, GPU-portable subset of OpenVDB. It cannot be modified after construction. The `nanovdb/tools/CreateNanoGrid.h` and adjacent files handle conversion from OpenVDB grids to NanoVDB grids. The current branch (`mesh-to-grid`) adds `tools/cuda/MeshToGrid.cuh` for direct CUDA mesh-to-NanoVDB conversion. + +### OpenVDB AX + +AX compiles a domain-specific expression language to LLVM IR for execution over OpenVDB volumes and point grids. The pipeline is: source string → AST (`ast/`) → typed analysis → LLVM codegen (`codegen/`) → JIT execution via `compiler/`. + +## C++ Standard and ABI + +- Requires C++17 minimum +- ABI version is set at compile time via `OPENVDB_ABI_VERSION_NUMBER`; the current version is 13 +- Headers are in `openvdb/openvdb/` and installed to `include/openvdb/` + +## Dependencies + +Core: Boost ≥ 1.82, TBB ≥ 2020.3, Blosc ≥ 1.17, OpenEXR/Imath ≥ 3.2, zlib ≥ 1.2.7 +Tests: GTest ≥ 1.10 +Python bindings: Python ≥ 3.11, nanobind ≥ 2.5.0 +NanoVDB GPU: CUDA toolkit +AX: LLVM + +On Linux, ASWF Docker containers (used by CI) bundle most dependencies. See `ci/install_macos.sh` and `ci/install_windows.ps1` for platform-specific setup. + +## Coding Standards + +Follow the style guide at https://www.openvdb.org/documentation/doxygen/codingStyle.html. Contributions require a Developer Certificate of Origin sign-off (`git commit -s`) and a CLA on file — see CONTRIBUTING.md. diff --git a/nanovdb/nanovdb/NanoVDB.h b/nanovdb/nanovdb/NanoVDB.h index 78955cd251..8ed16899c4 100644 --- a/nanovdb/nanovdb/NanoVDB.h +++ b/nanovdb/nanovdb/NanoVDB.h @@ -4136,15 +4136,26 @@ struct NANOVDB_ALIGN(NANOVDB_DATA_ALIGNMENT) LeafDatahasStats() ? this->lastOffset() + 2u : 0u; } __hostdev__ uint64_t getAvg() const { return this->hasStats() ? this->lastOffset() + 3u : 0u; } __hostdev__ uint64_t getDev() const { return this->hasStats() ? this->lastOffset() + 4u : 0u; } + // Default branchless; define NANOVDB_USE_BRANCHY_GETVALUE to restore the + // pre-2026 branchy form. __hostdev__ uint64_t getValue(uint32_t i) const { - //return mValueMask.isOn(i) ? mOffset + mValueMask.countOn(i) : 0u;// for debugging +#ifdef NANOVDB_USE_BRANCHY_GETVALUE uint32_t n = i >> 6; const uint64_t w = BaseT::mValueMask.words()[n], mask = uint64_t(1) << (i & 63u); - if (!(w & mask)) return uint64_t(0); // if i'th value is inactive return offset to background value + if (!(w & mask)) return uint64_t(0); uint64_t sum = BaseT::mOffset + util::countOn(w & (mask - 1u)); if (n--) sum += BaseT::mPrefixSum >> (9u * n) & 511u; return sum; +#else + const uint32_t n = i >> 6; + const uint64_t w = BaseT::mValueMask.words()[n]; + const uint64_t bit = uint64_t(1) << (i & 63u); + const uint64_t prefix = n == 0u ? uint64_t(0) + : (BaseT::mPrefixSum >> (9u * (n - 1u))) & 511u; + const uint64_t sum = BaseT::mOffset + prefix + util::countOn(w & (bit - 1u)); + return ((w & bit) ? ~uint64_t(0) : uint64_t(0)) & sum; +#endif } }; // LeafData diff --git a/nanovdb/nanovdb/examples/CMakeLists.txt b/nanovdb/nanovdb/examples/CMakeLists.txt index 780a4f59bd..1c337b7f48 100644 --- a/nanovdb/nanovdb/examples/CMakeLists.txt +++ b/nanovdb/nanovdb/examples/CMakeLists.txt @@ -114,6 +114,22 @@ nanovdb_example(NAME "ex_merge_nanovdb_cuda" OPENVDB) nanovdb_example(NAME "ex_refine_nanovdb_cuda" OPENVDB) nanovdb_example(NAME "ex_coarsen_nanovdb_cuda" OPENVDB) +nanovdb_example(NAME "ex_voxelBlockManager_host_cuda") +if(TARGET ex_voxelBlockManager_host_cuda) + target_compile_options(ex_voxelBlockManager_host_cuda PRIVATE + $<$:-Xcompiler=-mavx2,-fopenmp-simd> + $<$:-mavx2 -fopenmp-simd>) +endif() + +# End-to-end CPU WENO5 norm-square-gradient on a narrow-band level set, +# with a scalar reference for correctness validation. +nanovdb_example(NAME "ex_weno_nanovdb_cpu" OPENVDB) +if(TARGET ex_weno_nanovdb_cpu) + target_compile_options(ex_weno_nanovdb_cpu PRIVATE -march=native -fopenmp-simd) + target_include_directories(ex_weno_nanovdb_cpu PRIVATE + ${CMAKE_CURRENT_SOURCE_DIR}/../../..) +endif() + if(CUDAToolkit_FOUND) nanovdb_example(NAME "ex_make_mgpu_nanovdb") # requires cuRAND target_link_libraries(ex_make_mgpu_nanovdb PRIVATE CUDA::curand) diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VBMImplementationKnowledge.md b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VBMImplementationKnowledge.md new file mode 100644 index 0000000000..105e769b9a --- /dev/null +++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VBMImplementationKnowledge.md @@ -0,0 +1,337 @@ +# VBM Implementation Knowledge Base + +This document captures non-obvious design decisions, rejected alternatives, invariants, +and performance phenomenology for the VoxelBlockManager (VBM) subsystem and its CPU +port of `decodeInverseMaps`. It is written as dense, structured, agent-consumable +facts rather than narrative. It complements `VoxelBlockManagerContext.md` (semantics +and API). + +--- + +## 1. NanoVDB Leaf Mask Layout — Ambient Space + +A `Mask<3>` stores 512 bits as 8 `uint64_t` words (`maskWords[x]`, x = 0..7). +The NanoVDB linear voxel index within a leaf is `i = x*64 + y*8 + z`, where: +- x = word index (0..7) +- y = byte index within the word (0..7) — bits 8y..8y+7 of `maskWords[x]` +- z = bit index within the byte (0..7) + +**Invariant**: word x covers leaf-local positions x*64 .. x*64+63. The bit at +position `y*8+z` within word x corresponds to leaf-local index `x*64 + y*8 + z`. + +**Why this matters**: every algorithm that operates on the mask must respect this +layout. The "fast" dimension (z, innermost) is the bit dimension within a byte. +The "slow" dimension (y) is the byte dimension within a word. The "word" dimension +(x) is independent across the 8 words. + +--- + +## 2. mPrefixSum Encoding + +`LeafIndexBase::mPrefixSum` (a `uint64_t`) encodes 7 exclusive cumulative popcounts +at the word boundaries, packed as 7 x 9-bit fields: + + bits 9*(x-1) .. 9*(x-1)+8 = total active voxels in words 0..x-1, x = 1..7 + +- Field x-1 is the **exclusive** prefix at word x — i.e., how many active voxels + precede word x in the leaf. +- xOffset[0] = 0 always (implicit, not stored). +- Maximum value per field: 512 (all 512 voxels in preceding words active) — fits in + 9 bits. +- **Precondition for buildMaskPrefixSums**: the caller must pass the leaf's own + `mPrefixSum` field, not a recomputed value. The field is set during grid + construction and is authoritative. + +--- + +## 3. buildMaskPrefixSums — Output Semantics + +`util::buildMaskPrefixSums(mask, prefixSum, offsets[512])` produces: + + offsets[i] = number of active voxels at leaf-local positions 0..i (inclusive) + +- Output is **inclusive** and **1-based**: for an active voxel at position i, + `offsets[i]` is its 1-based rank among all active voxels in the leaf. +- Output is **leaf-local**: cross-word offsets from `mPrefixSum` are folded in, + but the global leaf-first-offset (`leaf.data()->firstOffset()`) is NOT added. + `offsets[i]` is in range [1, 512]. +- For an **inactive** voxel at position i: `offsets[i] == offsets[i-1]` (no + increment). These values are valid and load-bearing for the `shfl_down` + compaction approach (see §6). +- To recover the **exclusive** (0-based) rank of an active voxel at i: + `offsets[i] - 1`. +- To recover the **global sequential index** of an active voxel at position i: + `leafFirstOffset - 1 + offsets[i]` + where `leafFirstOffset = leaf.data()->firstOffset()` is the 1-based global index + of the leaf's first active voxel. + +--- + +## 4. transposeByteRow — What It Does and Why + +`transposeByteRow(src)` treats the low 8 bits of `src` as the first row of an 8x8 +bit matrix and returns the result of transposing it: + + output.ui8[z] = (src >> z) & 1 for z = 0..7 + +It is the single-row specialization of `transposeBits8x8`: if only the first row of +an 8x8 bit matrix is non-zero, the full transpose reduces to this operation. + +**Why it is used in buildMaskPrefixSums (Step 1 — indicator fill)**: +The algorithm needs `data[x][y].ui8[z] = indicator(x, y, z)` — a 0/1 value per +(x, y, z) triple. The indicator for (x, y, z) is bit `y*8+z` of `maskWords[x]`, +which is bit z of byte y of `maskWords[x]`. `transposeByteRow(maskWords[x] >> (y*8))` +extracts byte y and places bit z into byte z of the result — exactly the required +indicator layout. + +**Equivalent hardware instruction**: `_pdep_u64(src & 0xFF, kSpread)` on x86, where +`kSpread = 0x0101010101010101`. The software implementation is used for portability. + +**Why not `& kSpread` (the simpler alternative)**: `(maskWords[x] >> z) & kSpread` +would extract bit z from each byte — which is what Plan #2 needed (data[x][z] layout). +Plan #1 uses data[x][y] layout, requiring the byte dimension to be the output axis, +not the input axis. See §5 for the Plan #1 vs Plan #2 decision. + +--- + +## 5. Plan #1 vs Plan #2 — The Rejected Alternative + +**Plan #2** would have used layout `data[x][z].ui8[y]`: +- Simpler indicator fill: `data[x][z].ui64 = (maskWords[x] >> z) & kSpread` + (just a shift and mask — no `transposeByteRow` needed). +- After the z-pass (Hillis-Steele within-uint64 over y), the output is in + `data[x][z]` order, not `data[x][y]` order. +- **Fatal cost**: before zero-extending to uint16_t, a `transposeBytes8x8` call + per x-slice is required to reorder from `data[x][z].ui8[y]` to the required + linear output order. This is ~200 instructions per call x 8 calls = ~1600 + instructions of transpose overhead, dominating the ~14-cycle z+y passes. + +**Plan #1** (chosen) uses layout `data[x][y].ui8[z]`: +- Indicator fill requires `transposeByteRow` — slightly more expensive than `& kSpread`. +- After the z-pass and y-pass, the output is already in linear order (`x*64 + y*8 + z`). +- **No output transpose**: zero-extension to uint16_t is a straight vpmovzxbw over + contiguous memory. + +**Decision criterion**: Plan #1 eliminates the expensive output transpose at the cost +of a cheaper input transformation. The output transpose (1024 bytes) is intrinsically +more expensive than the input transformation (64 bytes). + +--- + +## 6. Compaction Approaches — Inclusive vs Exclusive, Active vs Inactive + +Two approaches exist for building `leafLocalOffsets[j]` (local position of j-th active +voxel) from `offsets`: + +**Scatter approach** (simple, used in decodeInverseMaps): + For each active position i: `leafLocalOffsets[offsets[i] - 1] = i` + Only requires `offsets[i]` at active positions. Inactive positions are not read. + +**shfl_down approach** (deeper SIMD, see DecodeInverseMapsCPUPlan §4e): + Requires `shifts[i] = i - (offsets[i] - isActive(i))` for ALL positions, active + and inactive. The `move[i]` predicate for each pass depends on `shifts[i]` even + for inactive voxels. `buildMaskPrefixSums` correctly provides values at all 512 + positions for this purpose. + +**Key invariant**: `offsets` from `buildMaskPrefixSums` is valid at all 512 positions, +not just active ones. This is intentional and necessary for the shfl_down path. + +--- + +## 7. decodeInverseMaps CPU — Current Implementation and Performance History + +**Current implementation** (`VoxelBlockManager.h`, branch `vbm-cpu-port`): +For each leaf overlapping the block: +1. Early-exit `break` if `leafFirstOffset >= blockFirstOffset + BlockWidth` (monotonicity + invariant: all subsequent leaves also fall outside the block). +2. Build `shifts[513]` directly via `buildMaskPrefixSums`: `shifts[0]=0`, + `buildMaskPrefixSums(..., shifts+1)` writes inclusive 0-bit counts into + `shifts[1..512]`, giving `shifts[i]` = exclusive count of inactive voxels in [0..i-1]. + `leafValueCount = 512 - shifts[512]` as a free by-product. +3. Run 9 in-place shuffle-down passes (Shift=1,2,4,...,256) via `util::shuffleDownMask` with a single buffer. +4. Range fill `leafIndex[pStart..pEnd)` and contiguous copy from `leafLocalOffsets`. + +**Eliminated from earlier design**: the separate `prefixSums[513]` array and the explicit +`shifts[i] = i - prefixSums[i]` subtraction loop. `buildMaskPrefixSums` writes +`shifts[]` directly in one pass by running the prefix sum over the inverted mask words and +adjusting the cross-word offsets as `64*x - ones` instead of `ones`. + +**Performance history (2M voxels / 16384 blocks / 25% occupancy / 24 OMP threads / AVX2)**: + +All figures below are wall-clock time for the full `decodeInverseMaps` loop over all +16384 blocks. Entries marked *unoptimized* were measured with `CMAKE_BUILD_TYPE` unset +(effectively `-O0` host compilation via NVCC); entries marked *Release* used +`-DCMAKE_BUILD_TYPE=Release` (`-O3 -DNDEBUG`). + +- Original `getValue()` loop (unoptimized): ~77 ms +- `buildMaskPrefixSums` + bit-scan scatter (unoptimized): ~65 ms (~15% improvement) +- `shuffleDownMask` without vectorization (-fopenmp missing in CUDA host flags, unoptimized): ~250 ms +- `shuffleDownMask` with vectorization (two-buffer `__restrict__`, -fopenmp fixed, unoptimized): ~15-20 ms +- `shuffleDownMask` single-buffer + `buildMaskPrefixSums` (unoptimized): ~15-17 ms (no regression) +- `shuffleDownMask` single-buffer + `buildMaskPrefixSums` (**Release**): **~0.9-1.6 ms (~15x speedup)** + +**Critical finding — build type**: with `CMAKE_BUILD_TYPE` unset, NVCC compiles host +code without optimization. Template functions (`shuffleDownMask`, `buildMaskPrefixSums`) +are not inlined, the `#pragma omp simd` loops are not vectorized, and all locals are +spilled to the stack. The resulting ~15-17 ms figure is not representative of real +performance. Always benchmark with `-DCMAKE_BUILD_TYPE=Release`. + +**Critical finding — -fopenmp**: `#pragma omp simd` is silently ignored for CUDA host +code unless `-Xcompiler -fopenmp` is explicitly added. Linking `OpenMP::OpenMP_CXX` +does NOT automatically propagate compile flags to CUDA sources via CMake. Without +-fopenmp, the shuffle-down passes compile as scalar loops. + +**Why shuffleDownMask beats bit-scan with vectorization**: +- Bit-scan is inherently scalar (data-dependent BSF/BSR instruction, variable trip count). +- shuffleDownMask's 9 passes are fixed-width, data-independent loops over 512 elements. +- With AVX2 (16 uint16_t per register), each pass takes ~32 vector ops vs ~128 scalar ops + for the bit-scan at 25% occupancy. + +**Recommended CMake invocation** (from the `build/` directory): +```bash +cmake .. \ + -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_CUDA_ARCHITECTURES=89 \ + -DOpenVDB_ROOT=/home/esifakis/local \ + -DNANOVDB_USE_CUDA=ON \ + -DNANOVDB_USE_OPENVDB=ON \ + -DNANOVDB_USE_TBB=ON \ + -DNANOVDB_BUILD_EXAMPLES=ON \ + -DNANOVDB_BUILD_UNITTESTS=ON \ + -DNANOVDB_BUILD_TOOLS=ON +``` + +--- + +## 8. Critical Portability Notes + +**`UINT64_C` vs `UL` suffix**: always use `UINT64_C(...)` for 64-bit hex constants. +`UL` is 32 bits on MSVC/Windows. Occurrences of `UL`-suffixed 64-bit constants in +`MorphologyHelpers.h` were corrected to `UINT64_C`. + +**`__restrict__` portability**: `__restrict__` (GCC/Clang) vs `__restrict` (MSVC). +NanoVDB has no existing C++ portability macro for this. `CNanoVDB.h` defines +`RESTRICT` but only for the C API. A `NANOVDB_RESTRICT` macro should be added +if `__restrict__` is used in host-only headers. + +**`#pragma omp simd` is safe without OpenMP**: unknown pragmas are silently ignored +by standard-conforming C++17 compilers. All major compilers recognize the `omp` +namespace. The pragma is present in `buildMaskPrefixSums` for portability; GCC 13.3 +auto-vectorizes the loops correctly even without it. + +**`#pragma omp simd` defeated by hardware POPCNT**: under `-mavx2`, GCC replaces +software Hamming-weight expressions with scalar `popcntl` and ignores the simd pragma. +The `popcount32` + `#pragma omp simd` approach in `vbm_host_cuda_kernels.cu` (sanity +bench) requires `-mno-popcnt` to vectorize, which is unsuitable for production. +`buildMaskPrefixSums` avoids this by not using popcount at all. + +--- + +## 9. No __hostdev__ on buildMaskPrefixSums — Deliberate Decision + +`buildMaskPrefixSums` is CPU-only. A CUDA equivalent would be organized around the +32-thread warp (using `__ballot_sync`, warp-level prefix intrinsics, or cooperative +group reductions) and would look fundamentally different. Marking it `__hostdev__` +would be misleading about intended usage and would invite incorrect porting. + +The CUDA `decodeInverseMaps` already has its own highly optimized implementation. +The CPU and CUDA decode paths are expected to remain separate implementations +indefinitely. + +--- + +## 10. Two Essential VBM Invariants (Not Enforced by NanoVDB Itself) + +These invariants are not structurally guaranteed by the NanoVDB data format, but all +reasonable ways of constructing a VBM-compatible grid enforce them, and the VBM +algorithms depend on them for correctness and for simplifying early-exit logic. + +**Invariant A — No empty leaves**: every leaf node in a VBM grid has at least one active +voxel (`leafValueCount > 0`). This is essential for the JumpMap encoding to give an +unambiguous leafID per block: a leaf with zero active voxels would contribute nothing to +the sequential voxel index but would still occupy a slot in the leaf array, breaking the +1-to-1 mapping between sequential voxel ranges and leaf IDs. + +**Invariant B — Monotonically non-decreasing voxelID → leafID mapping, step ≤ 1**: the +global sequential active voxel index increases monotonically across leaves, and consecutive +leaves' active voxel ranges are contiguous (no gaps, no overlaps). Formally: + `leafFirstOffset[k+1] = leafFirstOffset[k] + leafValueCount[k]` +This means active voxel ranges partition the global index space without gaps. + +**Consequences for `decodeInverseMaps`**: +- The check `leafValueCount == 0` is dead code and can be omitted (Invariant A). +- The check `leafFirstOffset + leafValueCount <= blockFirstOffset` (leaf entirely before + block) is impossible for any leaf in the iteration range, given that `firstLeafID` is + chosen to be the leaf containing `blockFirstOffset` (Invariant B + correct firstLeafID). +- When `leafFirstOffset >= blockFirstOffset + BlockWidth` fires for one leaf, it holds for + all subsequent leaves (Invariant B), so `break` is correct instead of `continue`. +- The early-exit collapses to a single line: + `if (leafFirstOffset >= blockFirstOffset + BlockWidth) break;` + +--- + +## 11. decodeInverseMaps CPU — Threading and Calling Philosophy + +`decodeInverseMaps` (CPU) is **intentionally single-threaded and stateless**. It decodes +exactly one voxel block per call. The caller (`buildVoxelBlockManager` CPU path) distributes +blocks across threads via `util::forEach` (TBB or `std::thread`). There is no OpenMP thread +parallelism inside the function — `#pragma omp simd` is present only to hint the compiler +toward SIMD vectorization, not to spawn threads. + +**Consequence for callers**: the output arrays `leafIndex[BlockWidth]` and +`voxelOffset[BlockWidth]` must be caller-allocated. The natural pattern is stack allocation +inside the `util::forEach` lambda, one array per block per thread. Heap allocation (e.g., +`new[]` per block) is wasteful; thread-local storage is an option but introduces alignment +complications (see §12). + +**Contrast with GPU**: the GPU `decodeInverseMaps` is cooperative — a full CUDA thread block +(up to 512 threads) decodes one voxel block together, writing to shared memory. On CPU, one +thread decodes one block sequentially but with SIMD across the 512 voxel positions. + +--- + +## 12. Why mPrefixSum Is Not Used in decodeInverseMaps + +`LeafIndexBase::mPrefixSum` encodes per-word exclusive cumulative popcounts as 7 × 9-bit +packed fields (see §2). It is designed for **random access** — locating the global sequential +index of a single voxel in O(1) without scanning the whole mask. + +`decodeInverseMaps` needs the full 512-entry prefix-sum table for every active voxel in the +leaf. For this **bulk sequential access**, recomputing per-word popcounts from scratch via the +`buildMaskPrefixSums` SIMD algorithm (§3) is cheaper than unpacking the 9-bit fields. +Specifically: +- Unpacking 7 × 9-bit fields requires masked shifts and is awkward to vectorize. +- `buildMaskPrefixSums` operates on the raw 8 `uint64_t` words, uses only shifts and adds, and + vectorizes cleanly to 8 AVX2 `vpaddq`/`vpsllq` pairs. +- At 25% occupancy (typical level-set narrow band), `buildMaskPrefixSums` consumes a small + fraction of the total per-leaf work. + +`mPrefixSum` is still used indirectly: it is the `prefixSum` argument passed to +`buildMaskPrefixSums` to fold in the cross-word exclusive offsets (Step 5 of the algorithm). +What is bypassed is the idea of using `mPrefixSum` alone (without a full table build) to compute +individual voxel offsets one at a time. + +--- + +## 13. Output Fill Structure — Range Fill + Contiguous Copy, Not Scatter + +The CPU `decodeInverseMaps` fills its output arrays per leaf as: + +``` +leafIndex[pStart..pEnd) = leafID (range fill, constant value) +voxelOffset[pStart..pEnd) = leafLocalOffsets[jStart..jStart+(pEnd-pStart)) (contiguous copy) +``` + +**Why contiguous copy (not scatter)**: `shuffleDownMask` produces `leafLocalOffsets` as a +compacted array of active local positions in ascending order — position 0 holds the 0-th active +voxel's leaf-local offset, position 1 holds the 1-st, etc. Since the block's output range +`[pStart, pEnd)` maps to a contiguous slice `[jStart, jStart+len)` of this compacted array, a +single `std::copy` (or `memcpy`) suffices. No scatter is needed. + +**`std::fill`/`std::copy` vectorization caveat**: if the output pointers are stack-allocated or +derived from TLS, the compiler may not prove alignment and fall back to scalar or SSE stores even +under `-mavx2`. For the sentinel-fill initialization (filling `UnusedLeafIndex` / +`UnusedVoxelOffset` before the leaf loop), this can dominate runtime. Measurements on the test +machine showed `std::fill` via TLS taking ~1.5 ms vs ~0.22 ms with explicit `_mm256_store_si256` +over `alignas(64)` arrays (16384 blocks, 32 threads). Stack-allocate the output arrays with +`alignas(64)` and replace `std::fill` with explicit AVX2 stores when performance matters. diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VoxelBlockManagerContext.md b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VoxelBlockManagerContext.md new file mode 100644 index 0000000000..1ef2ab9bfb --- /dev/null +++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/VoxelBlockManagerContext.md @@ -0,0 +1,259 @@ +# VoxelBlockManager: Context and Design Guide + +This document captures the intent, semantics, and usage patterns of the +VoxelBlockManager (VBM) acceleration structure for NanoVDB OnIndex grids. +It is intended to give a complete mental model to anyone (human or AI assistant) +working on VBM development or writing SIMT kernels that consume it. + +--- + +## What problem does the VBM solve? + +A NanoVDB `ValueOnIndex` (OnIndex) grid assigns a unique sequential integer +index to each active voxel. These indices are dense: if the grid has N active +voxels, they are numbered 1..N (the index 0 is reserved). This sequential +layout enables efficient SIMT parallelism: a GPU kernel can launch one thread +per active voxel, with thread k processing voxel k. + +The challenge is *decoding*: given a voxel's sequential index, how do you find +its 3D coordinates and leaf node? The index space is flat but the tree is +hierarchical. Scanning the tree for each voxel is too expensive. + +The VBM solves this by precomputing two small metadata arrays that let any +thread, given only its sequential index, quickly determine which leaf node +contains that voxel and where within the leaf it lives. This decode is done +cooperatively at thread-block granularity (one thread block per "voxel block"), +using shared memory. + +--- + +## Core concepts + +### Voxel blocks + +The VBM partitions the active voxel index space into contiguous spans of +`BlockWidth` voxels called *voxel blocks*. `BlockWidth` is a compile-time +power of two (typically 128). Block `b` covers sequential indices: + + [firstOffset + b * BlockWidth, firstOffset + (b+1) * BlockWidth - 1] + +A GPU kernel processes one voxel block per thread block, with `BlockWidth` +threads. Thread `t` in block `b` is responsible for voxel index +`firstOffset + b * BlockWidth + t`. + +### firstOffset and lastOffset + +`firstOffset` is the base of the VBM's index range. It must satisfy +`firstOffset == 1 (mod BlockWidth)` (i.e., it is "block-aligned"). For a +single-GPU build covering the full grid, `firstOffset = 1` and +`lastOffset = activeVoxelCount()`. + +In a **multi-GPU** setting, each rank owns a contiguous slice of the active +voxel index space. Rank r uses `firstOffset_r` and `lastOffset_r` to build a +VBM covering only its slice, even though all ranks hold a copy of the full +grid topology. The VBM metadata is then sized and indexed relative to each +rank's own `firstOffset`. + +### blockCount + +`blockCount` is the **allocated capacity** of the VBM metadata buffers. It +must be >= `ceil((lastOffset - firstOffset + 1) / BlockWidth)` but may be +larger. This allows pre-allocating a larger handle and rebuilding in-place +for a range that grows over time, without reallocating. + +--- + +## Metadata arrays + +### firstLeafID (uint32_t[blockCount]) + +`firstLeafID[b]` is the index of the **first leaf node** that overlaps voxel +block `b`. A leaf overlaps block `b` if any of its active voxels fall in +`[firstOffset + b*BlockWidth, firstOffset + (b+1)*BlockWidth - 1]`. + +In a sequential NanoVDB grid, leaf nodes are laid out contiguously in memory +in ascending order of their first active voxel index. So `firstLeafID[b]` is +the smallest leaf index `i` such that leaf `i` has at least one active voxel +in block `b`. + +### jumpMap (uint64_t[blockCount * JumpMapLength]) + +`JumpMapLength = BlockWidth / 64`. The jumpMap for block `b` is a bitfield of +`BlockWidth` bits (stored as `JumpMapLength` uint64_t words) where bit `p` is +set if and only if a new leaf node **begins** at position `p` within block `b` +(i.e., some leaf's first active voxel has sequential index +`firstOffset + b*BlockWidth + p`, and `p > 0`). + +Bit 0 is never set (the leaf starting exactly at the block boundary is +recorded in `firstLeafID`, not the jumpMap). + +Together, `firstLeafID[b]` and the jumpMap for block `b` enumerate all leaf +nodes that overlap block `b`: start at `firstLeafID[b]`, then count the set +bits in the jumpMap to find how many additional leaves follow. + +--- + +## Build API + +### Handle: VoxelBlockManagerHandle + +A pure data holder. Owns two buffers (`firstLeafID` and `jumpMap`) and stores +`blockCount`, `firstOffset`, `lastOffset`. No allocation or build logic is in +the handle itself. A default-constructed handle is the canonical "null/empty" +state (`blockCount == 0`). + +Accessors: +- `blockCount()`, `firstOffset()`, `lastOffset()` +- `hostFirstLeafID()` / `hostJumpMap()` -- host-side pointers +- `deviceFirstLeafID()` / `deviceJumpMap()` -- device-side pointers (only when + BufferT has a device dual, e.g. `cuda::DeviceBuffer` or a unified buffer) + +### Two-overload pattern + +Both the CPU (`nanovdb::tools::`) and CUDA (`nanovdb::tools::cuda::`) build +functions follow the same two-overload pattern, mirroring the NodeManager +convention in NanoVDB: + +**Allocating overload** -- returns a new, fully-constructed handle: + + // CPU + auto handle = nanovdb::tools::buildVoxelBlockManager(grid); + + // CUDA + auto handle = nanovdb::tools::cuda::buildVoxelBlockManager(d_grid, stream); + +Optional parameters (with sentinel 0 meaning "derive from grid"): +- `firstOffset` -- defaults to 1 +- `lastOffset` -- defaults to `activeVoxelCount()` (CPU) or read from device + via `DeviceGridTraits::getActiveVoxelCount()` (CUDA) +- `nBlocks` -- defaults to `ceil((lastOffset - firstOffset + 1) / BlockWidth)` + +If `lastOffset < firstOffset` (e.g., empty grid), a default-constructed null +handle is returned immediately with no allocation attempted. + +**Rebuild-in-place overload** -- takes a pre-allocated handle by reference, +zeroes the jumpMap, and recomputes both arrays. No allocation: + + // CPU + nanovdb::tools::buildVoxelBlockManager(grid, handle); + + // CUDA + nanovdb::tools::cuda::buildVoxelBlockManager(d_grid, handle, stream); + +A null handle (`blockCount == 0`) is silently ignored (no-op). This overload +is the right choice for benchmarking or for rebuilding after a +topology-preserving update without paying allocation cost. + +The allocating overload delegates to the rebuild overload after allocating +buffers -- no logic duplication. + +--- + +## CPU vs GPU implementation asymmetry + +### GPU: launch at lower-node granularity + +A NanoVDB level-1 internal node ("lower node") has up to 4096 leaf child +slots (16^3). The GPU kernel launches one CTA per lower node (subdivided +into `SlicesPerLowerNode = 8` slices for additional parallelism), so each +thread handles approximately one leaf child slot per iteration. Threads check +the lower node's `childMask` to skip empty slots. + +This grouping is chosen because: +- It naturally sizes the CTA workload (4096 slots / 8 slices / 128 threads = 4 slots per thread) +- Threads in the same warp access leaves from the same lower node, improving + memory access locality +- The grid is `<<>>` + +The cost is wasted threads for sparse lower nodes (few active leaf children +out of 4096 slots). This is an acceptable trade-off on the GPU. + +### CPU: iterate leaves directly + +On the CPU there is no benefit to the lower-node grouping. The build uses +`std::for_each(std::execution::par, firstLeaf, firstLeaf + leafCount, ...)`, +iterating directly over the flat contiguous leaf array. Each task processes +exactly one leaf -- no child mask checks, no wasted iterations. + +Leaf index is computed by pointer arithmetic: `&leaf - firstLeaf`. + +### Why firstLeafID writes are race-free + +In a sequential NanoVDB grid, leaf offset ranges are non-overlapping and +ordered. A leaf that spans from block `a` to block `a+k` (max k=3 for +BlockWidth=128 and leaves with <=512 active voxels) "backward-fills" +`firstLeafID[a+1..a+k]`. No other leaf can start in a block before `a+k` +without its offset range overlapping leaf `a`'s range -- which is impossible. +Hence at most one leaf writes each `firstLeafID[b]` entry, so the writes +are non-atomic. + +The jumpMap writes, by contrast, require atomic OR because multiple leaves +from different parts of the tree can start at positions within the same block. + +--- + +## Kernel usage pattern (SIMT consumer) + +The typical VBM-powered kernel: + + __global__ void myKernel( + NanoGrid* grid, + const uint32_t* firstLeafID, + const uint64_t* jumpMap, + uint64_t firstOffset, uint64_t lastOffset, uint32_t nBlocks) + { + __shared__ uint32_t smem_leafIndex[BlockWidth]; + __shared__ uint16_t smem_voxelOffset[BlockWidth]; + + int blockID = blockIdx.x; + uint64_t blockFirstOffset = firstOffset + (uint64_t)blockID * BlockWidth; + + // Cooperative decode: all threads in the block participate + VoxelBlockManager::decodeInverseMaps( + grid, + firstLeafID[blockID], + &jumpMap[blockID * VoxelBlockManager::JumpMapLength], + blockFirstOffset, + smem_leafIndex, + smem_voxelOffset); + // smem_leafIndex[t] and smem_voxelOffset[t] now hold the leaf index + // and intra-leaf voxel offset for thread t's voxel. + // Entries beyond lastOffset are filled with UnusedLeafIndex / UnusedVoxelOffset. + + int t = threadIdx.x; + uint64_t globalIndex = blockFirstOffset + t; + if (globalIndex > lastOffset) return; + if (smem_leafIndex[t] == VoxelBlockManager::UnusedLeafIndex) return; + + // From here: access the voxel's 3D position, values, or stencil. + // VoxelBlockManager::computeBoxStencil(...) uses the same + // smem arrays to look up the 27 stencil neighbor indices. + } + + // Launch: one block per VBM block + myKernel<<>>( + d_grid, + vbmHandle.deviceFirstLeafID(), + vbmHandle.deviceJumpMap(), + vbmHandle.firstOffset(), + vbmHandle.lastOffset(), + (uint32_t)vbmHandle.blockCount()); + +Key points: +- `decodeInverseMaps` must be called by ALL threads in the block + (it uses `__syncthreads` internally). Do not call from divergent threads. +- `computeBoxStencil` does NOT synchronize and may be called per-thread. +- Voxels in the last partial block beyond `lastOffset` get sentinel values + (`UnusedLeafIndex = 0xffffffff`, `UnusedVoxelOffset = 0xffff`); always + guard with a `globalIndex <= lastOffset` check. + +--- + +## Files + +- `nanovdb/tools/VoxelBlockManager.h` -- Handle class, CPU build functions +- `nanovdb/tools/cuda/VoxelBlockManager.cuh` -- VoxelBlockManager device struct, + CUDA build functions +- `nanovdb/examples/ex_voxelBlockManager_host_cuda/` -- end-to-end example, + benchmarks CPU vs GPU build time, validates CPU/GPU metadata agreement, + validates full inverse map against grid structure diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda.cpp b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda.cpp new file mode 100644 index 0000000000..2ebc9370e0 --- /dev/null +++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda.cpp @@ -0,0 +1,96 @@ +// Copyright Contributors to the OpenVDB Project +// SPDX-License-Identifier: Apache-2.0 + +/*! + \file vbm_host_cuda.cpp + + \brief Test harness for the VoxelBlockManager (CUDA reference implementation). + + Generates a random sparse domain at a configurable occupancy level using a + Morton-curve layout, builds a ValueOnIndex NanoVDB grid, constructs the + VoxelBlockManager on the GPU, decodes the full inverse map (leafIndex, + voxelOffset per active voxel), and validates the result on the host. + + Usage: vbm_host_cuda [ambient_voxels [occupancy]] + ambient_voxels Total universe of voxel positions (default: 1048576) + occupancy Fraction of positions that are active in [0,1] (default: 0.5) +*/ + +#include + +#include +#include +#include +#include +#include + +void runVBMCudaTest(const std::vector& coords); + +/// @brief Unpack one component of a Morton-encoded index into a coordinate. +/// Keeps every third bit of the input, then packs them into a contiguous integer. +static uint32_t coordinate_bitpack(uint32_t x) +{ + x &= 0x49249249; + x |= (x >> 2); x &= 0xc30c30c3; + x |= (x >> 4); x &= 0x0f00f00f; + x |= (x >> 8); x &= 0xff0000ff; + x |= (x >> 16); x &= 0x0000ffff; + return x; +} + +/// @brief Generate active voxel coordinates at the requested occupancy level. +/// Voxels are drawn uniformly at random from a Morton-curve layout over +/// ambient_voxels positions, giving spatially coherent 3D coordinates. +static std::vector +generateDomain(int ambient_voxels, float occupancy, uint32_t seed = 42) +{ + const int target = (int)(occupancy * (float)ambient_voxels); + + std::mt19937 rng(seed); + std::uniform_int_distribution dist(0, ambient_voxels - 1); + + std::vector voxmap(ambient_voxels, false); + int active = 0; + while (active < target) { + int i = dist(rng); + if (!voxmap[i]) { voxmap[i] = true; ++active; } + } + + std::vector coords; + coords.reserve(active); + for (int i = 0; i < ambient_voxels; ++i) { + if (voxmap[i]) { + coords.emplace_back( + (int)coordinate_bitpack( i & 0x49249249), + (int)coordinate_bitpack((i >> 1) & 0x49249249), + (int)coordinate_bitpack((i >> 2) & 0x49249249)); + } + } + return coords; +} + +int main(int argc, char** argv) +{ + try { + int ambient_voxels = 8 * 1024 * 1024; + float occupancy = 0.25f; + + if (argc > 1) ambient_voxels = std::stoi(argv[1]); + if (argc > 2) occupancy = std::stof(argv[2]); + + occupancy = std::max(0.0f, std::min(1.0f, occupancy)); + + std::cout << "ambient_voxels = " << ambient_voxels << "\n" + << "occupancy = " << occupancy << "\n"; + + auto coords = generateDomain(ambient_voxels, occupancy); + std::cout << "Active voxels generated: " << coords.size() << "\n"; + + runVBMCudaTest(coords); + } + catch (const std::exception& e) { + std::cerr << "An exception occurred: \"" << e.what() << "\"\n"; + return 1; + } + return 0; +} diff --git a/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda_kernels.cu b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda_kernels.cu new file mode 100644 index 0000000000..91af736e9c --- /dev/null +++ b/nanovdb/nanovdb/examples/ex_voxelBlockManager_host_cuda/vbm_host_cuda_kernels.cu @@ -0,0 +1,398 @@ +// Copyright Contributors to the OpenVDB Project +// SPDX-License-Identifier: Apache-2.0 + +/*! + \file vbm_host_cuda_kernels.cu + + \brief CUDA implementation for the VoxelBlockManager test harness. + + Builds the VoxelBlockManager from a ValueOnIndex grid, decodes the full + inverse map (leafIndex, voxelOffset) for all active voxels, downloads the + result to the host, and validates it against the grid structure. + Also benchmarks CPU vs GPU VBM construction with repeated timed runs. +*/ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + + +#include +#include +#include +#include + +namespace { + +static constexpr int Log2BlockWidth = 7; +static constexpr int BlockWidth = 1 << Log2BlockWidth; // 128 + +using VBM = nanovdb::tools::cuda::VoxelBlockManager; +using CPUVBM = nanovdb::tools::VoxelBlockManager; + + +/// @brief For each VBM block, decode the inverse map and store +/// (leafIndex, voxelOffset) for every active voxel into global output arrays +/// indexed by (globalVoxelOffset - firstOffset). +/// Launch configuration: <<>> +__global__ void decodeAllBlocksKernel( + nanovdb::NanoGrid* grid, + const uint32_t* firstLeafID, + const uint64_t* jumpMap, + uint64_t firstOffset, + uint64_t lastOffset, + uint32_t nBlocks, + uint32_t* outLeafIndex, + uint16_t* outVoxelOffset) +{ + __shared__ uint32_t smem_leafIndex[BlockWidth]; + __shared__ uint16_t smem_voxelOffset[BlockWidth]; + + uint32_t bID = blockIdx.x; + if (bID >= nBlocks) return; + + uint64_t blockFirstOffset = firstOffset + (uint64_t)bID * BlockWidth; + + VBM::decodeInverseMaps( + grid, + firstLeafID[bID], + &jumpMap[(uint64_t)bID * VBM::JumpMapLength], + blockFirstOffset, + smem_leafIndex, + smem_voxelOffset); + + int tID = threadIdx.x; + uint64_t globalIdx = blockFirstOffset + tID; + + if (globalIdx <= lastOffset && smem_leafIndex[tID] != VBM::UnusedLeafIndex) { + uint64_t k = globalIdx - firstOffset; // 0-based + outLeafIndex[k] = smem_leafIndex[tID]; + outVoxelOffset[k] = smem_voxelOffset[tID]; + } +} + +/// @brief For each VBM block, decode the inverse map into shared memory and +/// write a token value per block to dummyOut to prevent dead code elimination. +/// No per-voxel output is written to global memory. +/// Launch configuration: <<>> +__global__ void benchDecodeKernel( + nanovdb::NanoGrid* grid, + const uint32_t* firstLeafID, + const uint64_t* jumpMap, + uint64_t firstOffset, + uint64_t lastOffset, + uint32_t nBlocks, + uint32_t* dummyOut) +{ + __shared__ uint32_t smem_leafIndex[BlockWidth]; + __shared__ uint16_t smem_voxelOffset[BlockWidth]; + + uint32_t bID = blockIdx.x; + if (bID >= nBlocks) return; + + uint64_t blockFirstOffset = firstOffset + (uint64_t)bID * BlockWidth; + + VBM::decodeInverseMaps( + grid, + firstLeafID[bID], + &jumpMap[(uint64_t)bID * VBM::JumpMapLength], + blockFirstOffset, + smem_leafIndex, + smem_voxelOffset); + + // Token write from thread 0 only to prevent dead code elimination + if (threadIdx.x == 0) { + dummyOut[bID] = + (smem_leafIndex[0] != VBM::UnusedLeafIndex ? 1u : 0u) + + (smem_voxelOffset[0] != VBM::UnusedVoxelOffset ? 1u : 0u); + } +} + +} // anonymous namespace + +void runVBMCudaTest(const std::vector& coords) +{ + const uint64_t nCoords = coords.size(); + + // --- Build ValueOnIndex grid on GPU --- + nanovdb::Coord* d_coords = nullptr; + cudaCheck(cudaMalloc(&d_coords, nCoords * sizeof(nanovdb::Coord))); + cudaCheck(cudaMemcpy(d_coords, coords.data(), + nCoords * sizeof(nanovdb::Coord), cudaMemcpyHostToDevice)); + + auto handle = nanovdb::tools::cuda::voxelsToGrid( + d_coords, nCoords); + cudaCheck(cudaFree(d_coords)); + + auto* d_grid = handle.deviceGrid(); + if (!d_grid) throw std::runtime_error("Failed to create device grid"); + + // Download grid to host for validation and CPU build + handle.deviceDownload(); + auto* h_grid = handle.grid(); + if (!h_grid) throw std::runtime_error("Failed to download host grid"); + + const auto& tree = h_grid->tree(); + const uint64_t nVoxels = h_grid->activeVoxelCount(); + const uint32_t nBlocks = (uint32_t)((nVoxels + BlockWidth - 1) >> Log2BlockWidth); + + std::cout << "Active voxels (unique): " << nVoxels << "\n" + << "Leaf nodes : " << tree.nodeCount(0) << "\n" + << "Lower nodes : " << tree.nodeCount(1) << "\n" + << "Upper nodes : " << tree.nodeCount(2) << "\n" + << "VBM blocks : " << nBlocks + << " (BlockWidth=" << BlockWidth << ")\n\n"; + + // --- Benchmark VBM construction: GPU vs CPU --- + // Allocate handles once; timing runs reuse the buffers (memset + kernel only, + // no allocation overhead). First run per device serves as warmup - important + // for unified-memory buffers where the first access triggers page migration. + static constexpr int nRuns = 5; + + auto gpuHandle = nanovdb::tools::cuda::buildVoxelBlockManager(d_grid); + auto cpuHandle = nanovdb::tools::buildVoxelBlockManager(h_grid); + + // GPU build (cudaMemsetAsync + kernel, pre-allocated buffers) + { + float minMs = std::numeric_limits::max(); + for (int i = 0; i < nRuns; ++i) { + cudaCheck(cudaDeviceSynchronize()); // ensure stream is idle before timing + nanovdb::util::cuda::Timer gpuTimer; + nanovdb::tools::cuda::buildVoxelBlockManager(d_grid, gpuHandle); + float ms = gpuTimer.elapsed(); // records stop event and synchronizes + if (i > 0) minMs = std::min(minMs, ms); + } + std::cout << "GPU buildVoxelBlockManager (memset+kernel): min " << minMs + << " ms (over " << nRuns-1 << " post-warmup runs)\n"; + } + + // CPU build (std::memset + std::for_each(par), pre-allocated buffers) + { + float minMs = std::numeric_limits::max(); + for (int i = 0; i < nRuns; ++i) { + nanovdb::util::Timer cpuTimer; + cpuTimer.start(""); + nanovdb::tools::buildVoxelBlockManager(h_grid, cpuHandle); + float ms = (float)cpuTimer.elapsed() / 1000.0f; + if (i > 0) minMs = std::min(minMs, ms); + } + std::cout << "CPU buildVoxelBlockManager (memset+forEachPar): min " << minMs + << " ms (over " << nRuns-1 << " post-warmup runs)\n\n"; + } + + // --- Validate CPU build against GPU build --- + // Download GPU metadata to host and compare byte-for-byte with the CPU handle. + { + const uint64_t firstLeafIDBytes = gpuHandle.blockCount() * sizeof(uint32_t); + const uint64_t jumpMapBytes = gpuHandle.blockCount() * (BlockWidth / 64) * sizeof(uint64_t); + + std::vector gpuFirstLeafID(gpuHandle.blockCount()); + std::vector gpuJumpMap(gpuHandle.blockCount() * (BlockWidth / 64)); + + cudaCheck(cudaMemcpy(gpuFirstLeafID.data(), gpuHandle.deviceFirstLeafID(), + firstLeafIDBytes, cudaMemcpyDeviceToHost)); + cudaCheck(cudaMemcpy(gpuJumpMap.data(), gpuHandle.deviceJumpMap(), + jumpMapBytes, cudaMemcpyDeviceToHost)); + + const bool firstLeafIDMatch = std::memcmp(gpuFirstLeafID.data(), + cpuHandle.hostFirstLeafID(), firstLeafIDBytes) == 0; + const bool jumpMapMatch = std::memcmp(gpuJumpMap.data(), + cpuHandle.hostJumpMap(), jumpMapBytes) == 0; + + if (firstLeafIDMatch && jumpMapMatch) + std::cout << "CPU/GPU metadata match: PASSED\n"; + else + std::cerr << "CPU/GPU metadata match: FAILED" + << (firstLeafIDMatch ? "" : " [firstLeafID mismatch]") + << (jumpMapMatch ? "" : " [jumpMap mismatch]") << "\n"; + } + + // gpuHandle is the last-built GPU VBM; use it for decode/validation below + auto& vbmHandle = gpuHandle; + + // --- Decode all blocks into unified memory --- + // Using thrust::universal_vector so the kernel writes directly into host-accessible + // memory, avoiding a separate cudaMemcpy download step. + thrust::universal_vector outLeafIndex(nVoxels); + thrust::universal_vector outVoxelOffset(nVoxels); + + decodeAllBlocksKernel<<>>( + d_grid, + vbmHandle.deviceFirstLeafID(), + vbmHandle.deviceJumpMap(), + vbmHandle.firstOffset(), vbmHandle.lastOffset(), nBlocks, + outLeafIndex.data().get(), outVoxelOffset.data().get()); + cudaCheckError(); + cudaCheck(cudaDeviceSynchronize()); + + // --- Validate GPU decodeInverseMaps on host --- + // For each active voxel index k+1, the decoded (leafIndex, voxelOffset) + // must satisfy: leaf[leafIndex].getValue(voxelOffset) == k+1 + { + const auto* firstLeaf = tree.getFirstNode<0>(); + uint64_t errors = 0; + + for (uint64_t k = 0; k < nVoxels; ++k) { + const uint32_t leafIdx = outLeafIndex[k]; + const uint16_t voxelOff = outVoxelOffset[k]; + const uint64_t expectedIdx = k + vbmHandle.firstOffset(); + const uint64_t decodedIdx = firstLeaf[leafIdx].getValue(voxelOff); + + if (decodedIdx != expectedIdx) { + if (errors < 5) + std::cerr << "ERROR at k=" << k + << ": expected index " << expectedIdx + << ", decoded " << decodedIdx + << " (leaf=" << leafIdx + << " voxelOff=" << voxelOff << ")\n"; + ++errors; + } + } + + if (errors == 0) + std::cout << "GPU decodeInverseMaps: PASSED (all " << nVoxels + << " entries validated)\n"; + else + std::cerr << "GPU decodeInverseMaps: FAILED (" << errors << " / " + << nVoxels << " entries incorrect)\n"; + } + + // --- Validate CPU decodeInverseMaps against GPU results --- + // cpuHandle metadata was already verified to agree with gpuHandle above. + // Run the CPU decode block-by-block and compare against the unified-memory + // GPU output (which is now host-accessible after the sync above). + { + const uint32_t* firstLeafIDPtr = cpuHandle.hostFirstLeafID(); + const uint64_t* jumpMapPtr = cpuHandle.hostJumpMap(); + + std::vector cpuLeafIndex(BlockWidth); + std::vector cpuVoxelOffset(BlockWidth); + uint64_t errors = 0; + + for (uint32_t bID = 0; bID < nBlocks; ++bID) { + const uint64_t blockFirstOffset = + cpuHandle.firstOffset() + (uint64_t)bID * BlockWidth; + + CPUVBM::decodeInverseMaps( + h_grid, + firstLeafIDPtr[bID], + &jumpMapPtr[(uint64_t)bID * CPUVBM::JumpMapLength], + blockFirstOffset, + cpuLeafIndex.data(), + cpuVoxelOffset.data()); + + for (int tID = 0; tID < BlockWidth; ++tID) { + const uint64_t k = (uint64_t)bID * BlockWidth + tID; + if (k >= nVoxels) break; + if (cpuLeafIndex[tID] != outLeafIndex[k] || + cpuVoxelOffset[tID] != outVoxelOffset[k]) { + if (errors < 5) + std::cerr << "CPU/GPU decodeInverseMaps mismatch at k=" << k + << ": cpu=(" << cpuLeafIndex[tID] << "," + << cpuVoxelOffset[tID] << ") gpu=(" + << outLeafIndex[k] << "," << outVoxelOffset[k] + << ")\n"; + ++errors; + } + } + } + + if (errors == 0) + std::cout << "CPU decodeInverseMaps vs GPU: PASSED\n"; + else + std::cerr << "CPU decodeInverseMaps vs GPU: FAILED (" + << errors << " mismatches)\n"; + } + + // --- GPU decodeInverseMaps performance benchmark --- + // benchDecodeKernel decodes into shared memory only; one token uint32_t per + // block is written to dummyOut to prevent dead code elimination. + { + thrust::universal_vector dummyOut(nBlocks, 0); + + static constexpr int nPerfRuns = 20; + + // Warmup launch (first kernel invocation may incur JIT / page-migration cost) + benchDecodeKernel<<>>( + d_grid, + vbmHandle.deviceFirstLeafID(), vbmHandle.deviceJumpMap(), + vbmHandle.firstOffset(), vbmHandle.lastOffset(), nBlocks, + dummyOut.data().get()); + cudaCheck(cudaDeviceSynchronize()); + + std::cout << "\nGPU decodeInverseMaps performance (" + << nBlocks << " blocks, " << nVoxels << " voxels):\n"; + + for (int run = 0; run < nPerfRuns; ++run) { + cudaCheck(cudaDeviceSynchronize()); + nanovdb::util::cuda::Timer gpuTimer; + benchDecodeKernel<<>>( + d_grid, + vbmHandle.deviceFirstLeafID(), vbmHandle.deviceJumpMap(), + vbmHandle.firstOffset(), vbmHandle.lastOffset(), nBlocks, + dummyOut.data().get()); + const float ms = gpuTimer.elapsed(); + std::cout << " run " << run << ": " << ms << " ms\n"; + } + + // Sum dummyOut on host (unified memory, already sync'd by last gpuTimer.elapsed()) + uint64_t dummy = 0; + for (uint32_t i = 0; i < nBlocks; ++i) dummy += dummyOut[i]; + std::cout << " (dummy=" << dummy << ")\n"; + } + + // --- CPU decodeInverseMaps performance benchmark --- + // Scratch arrays are reused across all blocks so the only memory traffic is + // reading VBM metadata and leaf data -- not writing nVoxels of output. + // Token reads of scratch[0] after each block defeat dead code elimination + // without adding meaningful cost. + { + const uint32_t* firstLeafIDPtr = cpuHandle.hostFirstLeafID(); + const uint64_t* jumpMapPtr = cpuHandle.hostJumpMap(); + + static constexpr int nPerfRuns = 20; + + std::atomic dummy{0}; + + std::cout << "\nCPU decodeInverseMaps performance (" + << nBlocks << " blocks, " << nVoxels << " voxels):\n"; + + for (int run = 0; run < nPerfRuns; ++run) { + nanovdb::util::Timer timer; + timer.start(""); + + nanovdb::util::forEach(0, nBlocks, 1, + [&](const nanovdb::util::Range1D& range) { + alignas(64) uint32_t scratchLeafIndex[BlockWidth]; + alignas(64) uint16_t scratchVoxelOffset[BlockWidth]; + for (auto bID = range.begin(); bID < range.end(); ++bID) { + const uint64_t blockFirstOffset = + cpuHandle.firstOffset() + (uint64_t)bID * BlockWidth; + + CPUVBM::decodeInverseMaps( + h_grid, + firstLeafIDPtr[bID], + &jumpMapPtr[(uint64_t)bID * CPUVBM::JumpMapLength], + blockFirstOffset, + scratchLeafIndex, + scratchVoxelOffset); + + dummy += (scratchLeafIndex[0] != CPUVBM::UnusedLeafIndex) ? 1u : 0u; + dummy += (scratchVoxelOffset[0] != CPUVBM::UnusedVoxelOffset) ? 1u : 0u; + } + }); + + const float ms = + (float)timer.elapsed() / 1000.0f; + std::cout << " run " << run << ": " << ms << " ms\n"; + } + std::cout << " (dummy=" << dummy.load() << ")\n"; + } +} diff --git a/nanovdb/nanovdb/examples/ex_weno_nanovdb_cpu/weno_nanovdb_cpu.cpp b/nanovdb/nanovdb/examples/ex_weno_nanovdb_cpu/weno_nanovdb_cpu.cpp new file mode 100644 index 0000000000..64d20debfe --- /dev/null +++ b/nanovdb/nanovdb/examples/ex_weno_nanovdb_cpu/weno_nanovdb_cpu.cpp @@ -0,0 +1,530 @@ +// Copyright Contributors to the OpenVDB Project +// SPDX-License-Identifier: Apache-2.0 + +/*! + \file weno_nanovdb_cpu.cpp + + \brief End-to-end CPU WENO5 norm-square-gradient on a narrow-band level + set, with a scalar reference for correctness validation. + + End-to-end pipeline: + + VBM decode -> per-batch sidecar value assembly -> out-of-band + sign-extrapolation -> SIMD Godunov WENO5 -> per-voxel |grad phi|^2 + output sidecar. + + Two passes run over the same .vdb input: + + reference : per-voxel scalar nanovdb::math::WenoStencil>::normSqGrad. + Tile values and in-leaf inactive values preserved through the + OpenVDB -> NanoVDB conversion carry correctly-signed + extrapolation "for free", matching our explicit extrapolate() + semantics on in-the-band-typical topology. + + fast : WenoStencil::gatherIndices() -> per-tap SIMD load -> + extrapolate() -> normSqGrad() -> per-lane scalar store. + + Both passes write to the same-shape output buffer, keyed by ValueOnIndex + slot; a histogram of |outputRef - outputFast| follows. + + Usage: + weno_nanovdb_cpu [--grid=] + [--threads=] + [--skip-validation] + + Build: + Configured via CMakeLists.txt in the parent examples/ directory. + Requires OpenVDB (for .vdb IO). No CUDA. +*/ + +#include +#include +#include +#include +#include +#include +#include +#include // scalar reference WenoStencil + +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +// ============================================================ +// Constants and type aliases +// ============================================================ + +static constexpr int Log2BlockWidth = 7; +static constexpr int BlockWidth = 1 << Log2BlockWidth; // 128 +static constexpr int SIMDw = 16; // float lane width + +using BuildT = nanovdb::ValueOnIndex; +using IndexGridT = nanovdb::NanoGrid; +using LeafT = nanovdb::NanoLeaf; +using FloatGridT = nanovdb::NanoGrid; +using CPUVBM = nanovdb::tools::VoxelBlockManager; + +// ============================================================ +// VDB loading and NanoVDB conversion +// ============================================================ + +static openvdb::FloatGrid::Ptr +loadFloatGridFromVdb(const std::string& path, const std::string& gridName) +{ + openvdb::io::File file(path); + file.open(false); // delayed loading off + + openvdb::GridBase::Ptr base; + if (!gridName.empty()) { + if (!file.hasGrid(gridName)) + throw std::runtime_error("no grid named \"" + gridName + "\" in " + path); + base = file.readGrid(gridName); + } else { + openvdb::GridPtrVecPtr grids = file.getGrids(); + for (auto& g : *grids) { + if (g && g->isType()) { base = g; break; } + } + if (!base) throw std::runtime_error("no openvdb::FloatGrid found in " + path); + } + file.close(); + + auto floatGrid = openvdb::gridPtrCast(base); + if (!floatGrid) throw std::runtime_error("grid is not an openvdb::FloatGrid"); + return floatGrid; +} + +/// NanoVDB conversion products shared across the two passes. +/// - floatHandle : NanoGrid -- tile values + in-leaf inactive values +/// preserved verbatim, used by the scalar reference stencil. +/// - indexHandle : NanoGrid -- the topology-only index grid. +/// - sidecar : float sidecar (slot 0 = background, slots 1..N = active +/// voxel values in NanoVDB indexing order). +struct ConvertedGrids { + nanovdb::GridHandle floatHandle; + nanovdb::GridHandle indexHandle; + std::vector sidecar; +}; + +static ConvertedGrids +convertFloatGrid(openvdb::FloatGrid& floatGrid) +{ + ConvertedGrids out; + + // Direct OpenVDB -> NanoVDB float conversion. No flags needed: + // - tile values at internal nodes are stored directly (mTable); + // - in-leaf inactive voxels share storage with active ones (dense 8^3). + // Both are preserved verbatim from the source grid. + out.floatHandle = nanovdb::tools::createNanoGrid(floatGrid); + + // Sidecar pipeline: build ValueOnIndex + float sidecar via the CreateNanoGrid builder. + nanovdb::tools::CreateNanoGrid builder(floatGrid); + out.indexHandle = builder.template getHandle< + nanovdb::ValueOnIndex, nanovdb::HostBuffer>( + /*channels =*/ 0u, + /*incStats =*/ false, + /*incTiles =*/ false); + out.sidecar.resize(builder.valueCount()); + builder.template copyValues(out.sidecar.data()); + + // NanoVDB convention: slot 0 is the "not-found / background" sentinel. + // copyValues doesn't touch it. Set it to the grid's background so the + // fast path can gather unconditionally (no per-lane branch on idx==0 during fill). + if (!out.sidecar.empty()) out.sidecar[0] = floatGrid.background(); + + return out; +} + +// ============================================================ +// Reference pass -- scalar WenoStencil per active voxel +// ============================================================ +// +// Uses nanovdb::math::WenoStencil>. Its moveTo(ijk) +// populates 19 taps via the float grid's accessor; taps outside the +// narrow band hit either in-leaf inactive slots (stored +-background by +// the OpenVDB narrow-band builder) or tile values at internal nodes +// (same convention). Both cases yield correctly-signed extrapolated +// values, matching the semantics of our explicit extrapolate(). +// +// Cross-path indexing: both outputRef and outputFast are indexed by +// the ValueOnIndex slot of a voxel. For each active ijk we compute +// normSqGrad on the float grid, then resolve its output slot via the +// index grid's accessor. + +static double +runReference(const FloatGridT& floatGrid, + const IndexGridT& indexGrid, + std::vector& outputRef) +{ + std::fill(outputRef.begin(), outputRef.end(), 0.f); + + const uint32_t nLeaves = indexGrid.tree().nodeCount(0); + + std::ostringstream sink; + nanovdb::util::Timer timer; + auto timeIt = [&](auto&& body) -> double { + timer.start("", sink); + body(); + return static_cast(timer.elapsed()); + }; + + return timeIt([&] { + nanovdb::util::forEach(uint32_t(0), nLeaves, uint32_t(1), + [&](const nanovdb::util::Range1D& range) { + // One scalar stencil + one index accessor per TBB task. + nanovdb::math::WenoStencil stencil(floatGrid); + auto indexAcc = indexGrid.getAccessor(); + + const auto* firstFloatLeaf = + floatGrid.tree().template getFirstNode<0>(); + const auto* firstIndexLeaf = + indexGrid.tree().template getFirstNode<0>(); + + for (uint32_t lid = range.begin(); lid != range.end(); ++lid) { + // The two grids share topology, so leaf LID in the + // index grid aligns with leaf LID in the float grid + // (same order of insertion). Iterate the index grid's + // active voxels -- those are the slots we need to fill. + const auto& indexLeaf = firstIndexLeaf[lid]; + (void)firstFloatLeaf; // stencil.moveTo uses its own acc + + for (auto it = indexLeaf.beginValueOn(); it; ++it) { + const nanovdb::Coord ijk = it.getCoord(); + stencil.moveTo(ijk); + const float r = stencil.normSqGrad(/*iso=*/0.f); + const uint64_t idx = indexAcc.getValue(ijk); + outputRef[idx] = r; + } + } + }); + }); +} + +// ============================================================ +// Fast pass -- WenoStencil::gatherIndices + WenoStencil compute +// ============================================================ +// +// Structure: +// for each VBM block: +// decodeInverseMaps -> leafIndex[128], voxelOffset[128] +// for each batch of SIMDw voxels: +// fill: scalar scatter from sidecar into raw_values[SIZE][SIMDw] +// via WenoStencil::gatherIndices() per voxel +// load: per-tap SIMD load into stencil.values[] / isActive[] +// extrapolate (sign-fix OOB lanes in-place, Simd) +// normSqGrad -> FloatV +// store: per-lane scalar write to outputFast[blockBase + p] + +static double +runFast(const IndexGridT& indexGrid, + const nanovdb::tools::VoxelBlockManagerHandle& vbmHandle, + const std::vector& sidecar, + std::vector& outputFast) +{ + std::fill(outputFast.begin(), outputFast.end(), 0.f); + + const LeafT* firstLeaf = indexGrid.tree().template getFirstNode<0>(); + const uint32_t nBlocks = (uint32_t)vbmHandle.blockCount(); + const uint32_t* firstLeafID = vbmHandle.hostFirstLeafID(); + const uint64_t* jumpMap = vbmHandle.hostJumpMap(); + const uint64_t firstOffset = vbmHandle.firstOffset(); + + const float dx = float(indexGrid.voxelSize()[0]); + + std::ostringstream sink; + nanovdb::util::Timer timer; + auto timeIt = [&](auto&& body) -> double { + timer.start("", sink); + body(); + return static_cast(timer.elapsed()); + }; + + return timeIt([&] { + nanovdb::util::forEach(size_t(0), size_t(nBlocks), size_t(1), + [&](const nanovdb::util::Range1D& range) { + using FloatV = nanovdb::util::experimental::Simd ; + using MaskV = nanovdb::util::experimental::SimdMask; + using StencilT = nanovdb::WenoStencil; + constexpr int SIZE = StencilT::size(); + + alignas(64) uint32_t leafIndex[BlockWidth]; + alignas(64) uint16_t voxelOffset[BlockWidth]; + + // Caller-owned fill-side scratch -- scalar scatter from the + // sidecar lands here, then a per-tap SIMD load moves the + // data into the stencil's Simd compute view. + alignas(64) float raw_values[SIZE][SIMDw]; + alignas(64) bool raw_active[SIZE][SIMDw]; + + StencilT stencil(dx); + + // One leaf-only ReadAccessor per TBB task; cache stays warm + // across the SIZE getValue calls in WenoStencil::gatherIndices(). + nanovdb::ReadAccessor acc(indexGrid.tree().root()); + + const float* const scIn = sidecar.data(); + float* const scOut = outputFast.data(); + + for (size_t bID = range.begin(); bID != range.end(); ++bID) { + CPUVBM::decodeInverseMaps( + &indexGrid, firstLeafID[bID], + &jumpMap[bID * CPUVBM::JumpMapLength], + firstOffset + bID * BlockWidth, + leafIndex, voxelOffset); + + const uint64_t blockBase = + firstOffset + (uint64_t)bID * BlockWidth; + + for (int batchStart = 0; batchStart < BlockWidth; batchStart += SIMDw) { + // -------- Fill: per-voxel gatherIndices + sidecar lookup -------- + for (int i = 0; i < SIMDw; ++i) { + const int p = batchStart + i; + + if (leafIndex[p] == CPUVBM::UnusedLeafIndex) { + for (int k = 0; k < SIZE; ++k) { + raw_values[k][i] = 0.f; + raw_active[k][i] = false; + } + continue; + } + + const uint16_t vo = voxelOffset[p]; + const uint32_t li = leafIndex[p]; + const nanovdb::Coord cOrigin = firstLeaf[li].origin(); + const int lx = (vo >> 6) & 7, ly = (vo >> 3) & 7, lz = vo & 7; + const nanovdb::Coord center = + cOrigin + nanovdb::Coord(lx, ly, lz); + + uint64_t indices[SIZE]; + nanovdb::WenoStencil::gatherIndices(acc, center, indices); + for (int k = 0; k < SIZE; ++k) { + raw_values[k][i] = scIn[indices[k]]; + raw_active[k][i] = (indices[k] != 0); + } + } + + // -------- Load: per-tap SIMD load into stencil view -------- + for (int k = 0; k < SIZE; ++k) { + stencil.values [k] = FloatV(raw_values[k], nanovdb::util::experimental::element_aligned); + stencil.isActive[k] = MaskV (raw_active[k], nanovdb::util::experimental::element_aligned); + } + + // -------- Phase-3 arithmetic (in-place on Simd values) -------- + stencil.extrapolate(); + const FloatV result = stencil.normSqGrad(/*iso=*/0.f); + + // -------- Simd -> scalar bridge + per-lane store -------- + alignas(64) float result_lanes[SIMDw]; + result.copy_to(result_lanes, nanovdb::util::experimental::element_aligned); + for (int i = 0; i < SIMDw; ++i) { + const int p = batchStart + i; + if (leafIndex[p] == CPUVBM::UnusedLeafIndex) continue; + scOut[blockBase + p] = result_lanes[i]; + } + } + } + }); + }); +} + +// ============================================================ +// Histogram comparison +// ============================================================ +// +// Per-index |outputRef[i] - outputFast[i]| over all active voxels +// (index 1..N; slot 0 is the background/no-op). Log-decade bins +// from 0 to 1e+1 plus a tail bucket for anything >= 10. +// +// Expected shape: the two leftmost bins ([0,1e-8), [1e-8,1e-7)) hold +// the overwhelming majority -- FP-rounding / FMA-fusion differences. +// Anything to the right of [1e-5,1e-4) warrants investigation. + +static void +reportHistogram(const std::vector& outputRef, + const std::vector& outputFast, + uint64_t nActive) +{ + // Bucket edges: 0, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e+0, 1e+1 + static constexpr int nBuckets = 12; + static constexpr double edges[nBuckets + 1] = { + 0.0, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, + 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0 + }; + static const char* labels[nBuckets] = { + "[0, 1e-10)", + "[1e-10, 1e-9 )", + "[1e-9, 1e-8 )", + "[1e-8, 1e-7 )", + "[1e-7, 1e-6 )", + "[1e-6, 1e-5 )", + "[1e-5, 1e-4 )", + "[1e-4, 1e-3 )", + "[1e-3, 1e-2 )", + "[1e-2, 1e-1 )", + "[1e-1, 1.0 )", + "[1.0, 10.0 )" + }; + + std::array counts{}; // last bucket = [10, inf) + double sumDelta = 0.0; + float maxDelta = 0.f; + uint64_t worstIdx = 0; + uint64_t counted = 0; + + for (uint64_t i = 1; i <= nActive; ++i) { + const float d = std::abs(outputRef[i] - outputFast[i]); + ++counted; + sumDelta += double(d); + if (d > maxDelta) { maxDelta = d; worstIdx = i; } + + int b = nBuckets; // overflow bucket + for (int k = 0; k < nBuckets; ++k) { + if (double(d) < edges[k + 1]) { b = k; break; } + } + ++counts[b]; + } + + std::printf("\n|Delta| histogram across %lu active voxels" + " (outputRef vs outputFast):\n", counted); + for (int k = 0; k < nBuckets; ++k) { + const double pct = counted ? 100.0 * double(counts[k]) / double(counted) : 0.0; + std::printf(" %-18s : %12lu (%6.2f%%)\n", + labels[k], counts[k], pct); + } + const double pctTail = counted ? 100.0 * double(counts[nBuckets]) / double(counted) : 0.0; + std::printf(" %-18s : %12lu (%6.2f%%)\n", + "[10.0, inf )", counts[nBuckets], pctTail); + + std::printf("\n max |Delta| = %.6g (at slot %lu:" + " ref=%.6g, fast=%.6g)\n", + double(maxDelta), worstIdx, + double(outputRef[worstIdx]), + double(outputFast[worstIdx])); + std::printf(" mean |Delta| = %.6g\n", + counted ? sumDelta / double(counted) : 0.0); +} + +// ============================================================ +// Entry point +// ============================================================ + +static void printUsage(const char* argv0) +{ + std::cerr + << "Usage: " << argv0 << " " + << " [--grid=] [--threads=]\n" + << "\n" + << " Input OpenVDB file (single FloatGrid narrow-band)\n" + << " --grid= Select grid by name (default: first FloatGrid)\n" + << " --threads= Limit TBB parallelism (0 = TBB default)\n"; +} + +int main(int argc, char** argv) +{ + try { + if (argc < 2 || std::string(argv[1]) == "--help" + || std::string(argv[1]) == "-h") { + printUsage(argv[0]); + return argc < 2 ? 1 : 0; + } + + std::string vdbPath = argv[1]; + std::string gridName = ""; + int nThreads = 0; + + for (int i = 2; i < argc; ++i) { + std::string a = argv[i]; + if (a.rfind("--grid=", 0) == 0) gridName = a.substr(7); + else if (a.rfind("--threads=", 0) == 0) nThreads = std::stoi(a.substr(10)); + else { printUsage(argv[0]); return 1; } + } + + std::cout << "vdb path = " << vdbPath << "\n" + << "grid name = " << (gridName.empty() ? "(first FloatGrid)" : gridName) << "\n" + << "threads = " << (nThreads > 0 ? std::to_string(nThreads) + : std::string("(TBB default)")) << "\n"; + + // ---- Load the .vdb and convert to both NanoVDB representations ---- + openvdb::initialize(); + auto floatGrid = loadFloatGridFromVdb(vdbPath, gridName); + + const auto bbox = floatGrid->evalActiveVoxelBoundingBox(); + const auto vsize = floatGrid->voxelSize(); + std::cout << "FloatGrid:\n" + << " active voxels = " << floatGrid->activeVoxelCount() << "\n" + << " bbox = [" << bbox.min() << " .. " << bbox.max() << "]\n" + << " voxel size = " << vsize << "\n" + << " background = " << floatGrid->background() << "\n"; + + auto payload = convertFloatGrid(*floatGrid); + auto* nanoFloatGrid = payload.floatHandle.grid(); + auto* indexGrid = payload.indexHandle.grid(); + if (!nanoFloatGrid || !indexGrid) + throw std::runtime_error("NanoVDB conversion failed"); + + const auto& tree = indexGrid->tree(); + std::cout << "NanoVDB:\n" + << " leaves = " << tree.nodeCount(0) << "\n" + << " active voxels = " << indexGrid->activeVoxelCount() << "\n" + << " sidecar size = " << payload.sidecar.size() << "\n"; + + // ---- VBM for the fast path ---- + auto vbmHandle = nanovdb::tools::buildVoxelBlockManager(indexGrid); + std::cout << "VBM:\n" + << " blocks = " << vbmHandle.blockCount() + << " (BlockWidth=" << BlockWidth << ")\n\n"; + + // ---- TBB thread cap for timings ---- + std::unique_ptr tbbLimit; + if (nThreads > 0) { + tbbLimit = std::make_unique( + tbb::global_control::max_allowed_parallelism, (size_t)nThreads); + } + + // ---- Output buffers ---- + std::vector outputRef (payload.sidecar.size(), 0.f); + std::vector outputFast(payload.sidecar.size(), 0.f); + + // ---- Run both passes (warm + timed) ---- + // Warm pass (ignored) for both, then one timed pass each. + (void)runReference(*nanoFloatGrid, *indexGrid, outputRef); + (void)runFast(*indexGrid, vbmHandle, payload.sidecar, outputFast); + + const double refUs = runReference(*nanoFloatGrid, *indexGrid, outputRef); + const double fastUs = runFast(*indexGrid, vbmHandle, payload.sidecar, outputFast); + + const uint64_t nActive = indexGrid->activeVoxelCount(); + const double refNs = refUs * 1e3 / double(nActive); + const double fastNs = fastUs * 1e3 / double(nActive); + + std::printf("\nEnd-to-end WENO5 |grad phi|^2 (%lu active voxels):\n", nActive); + std::printf(" reference (scalar): %9.1f ms (%7.1f ns/voxel)\n", + refUs / 1e3, refNs); + std::printf(" fast (VBM+SIMD) : %9.1f ms (%7.2f ns/voxel) speedup: %.1fx\n", + fastUs / 1e3, fastNs, refUs / std::max(fastUs, 1.0)); + + // ---- Histogram of discrepancies ---- + reportHistogram(outputRef, outputFast, nActive); + + } catch (const std::exception& e) { + std::cerr << "Exception: " << e.what() << "\n"; + return 1; + } + return 0; +} diff --git a/nanovdb/nanovdb/math/Math.h b/nanovdb/nanovdb/math/Math.h index 19b1beee81..a3ee0df993 100644 --- a/nanovdb/nanovdb/math/Math.h +++ b/nanovdb/nanovdb/math/Math.h @@ -172,6 +172,11 @@ __hostdev__ inline double Max(double a, double b) { return fmax(a, b); } +template +__hostdev__ inline T Select(bool m, T a, T b) +{ + return m ? a : b; +} __hostdev__ inline float Clamp(float x, float a, float b) { return Max(Min(x, b), a); diff --git a/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh b/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh index 02f5180f67..d53839cab3 100644 --- a/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh +++ b/nanovdb/nanovdb/tools/cuda/VoxelBlockManager.cuh @@ -30,6 +30,7 @@ #include #include #include +#include namespace nanovdb { @@ -166,6 +167,144 @@ struct VoxelBlockManager : nanovdb::tools::VoxelBlockManagerBase } } } + + /// @brief Auxiliary type holding the resolved neighbor leaf pointers for + /// the WENO5 stencil. ptrs[axis][0] is the lo neighbor along that axis + /// (nullptr if outside the narrow band), ptrs[axis][1] is always the + /// center leaf, and ptrs[axis][2] is the hi neighbor (nullptr if outside). + template + struct WenoLeafPtrs { + const NanoLeaf* ptrs[3][3]; + }; + + /// @brief Resolve the neighbor leaf pointers needed by computeWenoStencil. + /// Performs exactly one probeLeaf call per axis (three total). Safe to call + /// per-thread; does not synchronize. + /// @tparam BuildT Build type of the grid (must be an index type) + /// @param grid Device-resident grid + /// @param leaf Center leaf node for the current voxel + /// @param voxelOffset Intra-leaf voxel offset for the current voxel + /// @return WenoLeafPtrs with center entries set to &leaf and lo/hi entries + /// set to the probeLeaf result (nullptr if outside the narrow band) + template + __device__ + static typename util::enable_if::is_index, WenoLeafPtrs>::type + resolveWenoLeafPtrs( + const NanoGrid* grid, + const NanoLeaf& leaf, + uint16_t voxelOffset) + { + WenoLeafPtrs result; + const auto coord = leaf.offsetToGlobalCoord(voxelOffset); + const auto localCoord = leaf.OffsetToLocalCoord(voxelOffset); + const auto& tree = grid->tree(); + + for (int axis = 0; axis < 3; ++axis) { + result.ptrs[axis][0] = nullptr; + result.ptrs[axis][1] = &leaf; + result.ptrs[axis][2] = nullptr; + + auto neighborCoord = coord; + neighborCoord[axis] += (localCoord[axis] & 0x4) ? 4 : -4; + result.ptrs[axis][(localCoord[axis] & 0x4) >> 1] = + tree.root().probeLeaf(neighborCoord); + } + return result; + } + + /// @brief Compute global sequential indices for the 19 WENO5 stencil + /// points of the given voxel, using pre-resolved leaf pointers. + /// + /// Output layout follows nanovdb::math::WenoPt::idx. Note that + /// this convention differs from OpenVDB's NineteenPt::idx. + /// + /// Entries for neighbors outside the narrow band are left unchanged; + /// the caller must zero-initialize data[] before calling this function. + /// Does not synchronize; safe to call from divergent threads. + /// + /// The voxelOffset arithmetic uses octal notation to exploit the fact that + /// the NanoVDB leaf layout encodes (x,y,z) as x*64 + y*8 + z, making x, y, + /// and z strides exactly 0100, 010, and 1 in octal respectively. + /// + /// @tparam BuildT Build type of the grid (must be an index type) + /// @param leaf Center leaf node for the current voxel + /// @param voxelOffset Intra-leaf voxel offset for the current voxel + /// @param leafPtrs Resolved neighbor leaf pointers from resolveWenoLeafPtrs + /// @param data Output array of length >= 19, caller-zero-initialized + template + __device__ + static typename util::enable_if::is_index, void>::type + computeWenoStencil( + const NanoLeaf& leaf, + uint16_t voxelOffset, + const WenoLeafPtrs& leafPtrs, + uint64_t* data) + { + using math::WenoPt; + const auto lc = leaf.OffsetToLocalCoord(voxelOffset); + + data[WenoPt< 0, 0, 0>::idx] = leaf.getValue(voxelOffset); + + // x-axis: stride per step = 64 = 0100 octal; cross-leaf wrap = ±8*64 = ±0700 + if (leafPtrs.ptrs[0][(lc.x() + 5) >> 3]) + data[WenoPt<-3, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 5) >> 3]->getValue( + voxelOffset + ((lc[0] < 3) ? 0500 : -0300)); + if (leafPtrs.ptrs[0][(lc.x() + 6) >> 3]) + data[WenoPt<-2, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 6) >> 3]->getValue( + voxelOffset + ((lc[0] < 2) ? 0600 : -0200)); + if (leafPtrs.ptrs[0][(lc.x() + 7) >> 3]) + data[WenoPt<-1, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 7) >> 3]->getValue( + voxelOffset + ((lc[0] < 1) ? 0700 : -0100)); + if (leafPtrs.ptrs[0][(lc.x() + 9) >> 3]) + data[WenoPt< 1, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 9) >> 3]->getValue( + voxelOffset + ((lc[0] < 7) ? 0100 : -0700)); + if (leafPtrs.ptrs[0][(lc.x() + 10) >> 3]) + data[WenoPt< 2, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 10) >> 3]->getValue( + voxelOffset + ((lc[0] < 6) ? 0200 : -0600)); + if (leafPtrs.ptrs[0][(lc.x() + 11) >> 3]) + data[WenoPt< 3, 0, 0>::idx] = leafPtrs.ptrs[0][(lc.x() + 11) >> 3]->getValue( + voxelOffset + ((lc[0] < 5) ? 0300 : -0500)); + + // y-axis: stride per step = 8 = 010 octal; cross-leaf wrap = ±8*8 = ±070 + if (leafPtrs.ptrs[1][(lc.y() + 5) >> 3]) + data[WenoPt< 0,-3, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 5) >> 3]->getValue( + voxelOffset + ((lc[1] < 3) ? 0050 : -0030)); + if (leafPtrs.ptrs[1][(lc.y() + 6) >> 3]) + data[WenoPt< 0,-2, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 6) >> 3]->getValue( + voxelOffset + ((lc[1] < 2) ? 0060 : -0020)); + if (leafPtrs.ptrs[1][(lc.y() + 7) >> 3]) + data[WenoPt< 0,-1, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 7) >> 3]->getValue( + voxelOffset + ((lc[1] < 1) ? 0070 : -0010)); + if (leafPtrs.ptrs[1][(lc.y() + 9) >> 3]) + data[WenoPt< 0, 1, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 9) >> 3]->getValue( + voxelOffset + ((lc[1] < 7) ? 0010 : -0070)); + if (leafPtrs.ptrs[1][(lc.y() + 10) >> 3]) + data[WenoPt< 0, 2, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 10) >> 3]->getValue( + voxelOffset + ((lc[1] < 6) ? 0020 : -0060)); + if (leafPtrs.ptrs[1][(lc.y() + 11) >> 3]) + data[WenoPt< 0, 3, 0>::idx] = leafPtrs.ptrs[1][(lc.y() + 11) >> 3]->getValue( + voxelOffset + ((lc[1] < 5) ? 0030 : -0050)); + + // z-axis: stride per step = 1; cross-leaf wrap = ±8 + if (leafPtrs.ptrs[2][(lc.z() + 5) >> 3]) + data[WenoPt< 0, 0,-3>::idx] = leafPtrs.ptrs[2][(lc.z() + 5) >> 3]->getValue( + voxelOffset + ((lc[2] < 3) ? 0005 : -0003)); + if (leafPtrs.ptrs[2][(lc.z() + 6) >> 3]) + data[WenoPt< 0, 0,-2>::idx] = leafPtrs.ptrs[2][(lc.z() + 6) >> 3]->getValue( + voxelOffset + ((lc[2] < 2) ? 0006 : -0002)); + if (leafPtrs.ptrs[2][(lc.z() + 7) >> 3]) + data[WenoPt< 0, 0,-1>::idx] = leafPtrs.ptrs[2][(lc.z() + 7) >> 3]->getValue( + voxelOffset + ((lc[2] < 1) ? 0007 : -0001)); + if (leafPtrs.ptrs[2][(lc.z() + 9) >> 3]) + data[WenoPt< 0, 0, 1>::idx] = leafPtrs.ptrs[2][(lc.z() + 9) >> 3]->getValue( + voxelOffset + ((lc[2] < 7) ? 0001 : -0007)); + if (leafPtrs.ptrs[2][(lc.z() + 10) >> 3]) + data[WenoPt< 0, 0, 2>::idx] = leafPtrs.ptrs[2][(lc.z() + 10) >> 3]->getValue( + voxelOffset + ((lc[2] < 6) ? 0002 : -0006)); + if (leafPtrs.ptrs[2][(lc.z() + 11) >> 3]) + data[WenoPt< 0, 0, 3>::idx] = leafPtrs.ptrs[2][(lc.z() + 11) >> 3]->getValue( + voxelOffset + ((lc[2] < 5) ? 0003 : -0005)); + } }; /// @brief This functor calculates the firstLeafID and jumpMap for the diff --git a/nanovdb/nanovdb/util/ForEach.h b/nanovdb/nanovdb/util/ForEach.h index d71769c5ab..6ede538a10 100644 --- a/nanovdb/nanovdb/util/ForEach.h +++ b/nanovdb/nanovdb/util/ForEach.h @@ -45,7 +45,7 @@ inline void forEach(RangeT range, const FuncT &func) #ifdef NANOVDB_USE_TBB tbb::parallel_for(range, func); #else// naive and likely slow alternative based on std::thread - if (const size_t threadCount = std::thread::hardware_concurrency()>>1) { + if (const size_t threadCount = std::thread::hardware_concurrency()) { std::vector rangePool{ range }; while(rangePool.size() < threadCount) { const size_t oldSize = rangePool.size(); diff --git a/nanovdb/nanovdb/util/Simd.h b/nanovdb/nanovdb/util/Simd.h new file mode 100644 index 0000000000..363a307fac --- /dev/null +++ b/nanovdb/nanovdb/util/Simd.h @@ -0,0 +1,297 @@ +// Copyright Contributors to the OpenVDB Project +// SPDX-License-Identifier: Apache-2.0 + +/*! + \file nanovdb/util/Simd.h + + \author Efty Sifakis + + \date April 28, 2026 + + \brief Minimal SIMD abstraction for NanoVDB stencil kernels. +*/ + +#ifndef NANOVDB_UTIL_SIMD_H_HAS_BEEN_INCLUDED +#define NANOVDB_UTIL_SIMD_H_HAS_BEEN_INCLUDED + +#include + +#include // __hostdev__ + +// Two implementations, selected automatically at compile time: +// +// stdx backend (default, when is available): +// Simd and SimdMask are aliases for +// std::experimental::fixed_size_simd / fixed_size_simd_mask. +// Internal flag: NANOVDB_USE_STDX_SIMD. +// +// std::array backend (forced via NANOVDB_SIMD_ARRAY_BACKEND, or when +// is unavailable): +// Simd wraps std::array with element-wise operator loops. +// +// The interface is identical in both cases, so templated kernels +// (T=float for GPU, T=Simd for CPU) compile unmodified. + +// --------------------------------------------------------------------------- +// Auto-detect std::experimental::simd (Parallelism TS v2) +// --------------------------------------------------------------------------- +#if !defined(NANOVDB_SIMD_ARRAY_BACKEND) && defined(__has_include) && __has_include() +# include +# ifdef __cpp_lib_experimental_parallel_simd +# define NANOVDB_USE_STDX_SIMD 1 +# endif +#endif + +namespace nanovdb { +namespace util { + +// =========================================================================== +// nanovdb::util::experimental -- internal SIMD primitives. Names in this +// nested namespace are unstable API by convention; external callers should +// not depend on them. +// =========================================================================== +namespace experimental { + +// element_aligned_tag -- load/store alignment descriptor. Aliases +// stdx::element_aligned_tag in the stdx backend; empty struct in the +// array backend. +#ifdef NANOVDB_USE_STDX_SIMD +namespace stdx = std::experimental; +using element_aligned_tag = stdx::element_aligned_tag; +#else +struct element_aligned_tag {}; +#endif +inline constexpr element_aligned_tag element_aligned{}; + +// =========================================================================== +// Implementation A: std::experimental::simd -- pure type aliases +// =========================================================================== +#ifdef NANOVDB_USE_STDX_SIMD + +template +using SimdMask = stdx::fixed_size_simd_mask; + +template +using Simd = stdx::fixed_size_simd; + +// =========================================================================== +// Implementation B: std::array backend (default) +// =========================================================================== +#else + +template struct Simd; // fwd-decl so SimdMask::simd_type can name it + +template +struct SimdMask { + using value_type = bool; + using simd_type = Simd; + static constexpr size_t size() { return size_t(W); } + + std::array data{}; + SimdMask() = default; + __hostdev__ SimdMask(bool b) { data.fill(b); } // broadcast + __hostdev__ explicit SimdMask(const bool* p, element_aligned_tag) { // load (ctor) + for (int i = 0; i < W; i++) data[i] = p[i]; + } + __hostdev__ bool operator[](int i) const { return data[i]; } + __hostdev__ bool& operator[](int i) { return data[i]; } + __hostdev__ void copy_from(const bool* p, element_aligned_tag) { // load (member) + for (int i = 0; i < W; i++) data[i] = p[i]; + } + __hostdev__ void copy_to(bool* p, element_aligned_tag) const { // store (member) + for (int i = 0; i < W; i++) p[i] = data[i]; + } + __hostdev__ SimdMask operator!() const { + SimdMask r; for (int i = 0; i < W; i++) r.data[i] = !data[i]; return r; + } + __hostdev__ SimdMask operator&(SimdMask o) const { + SimdMask r; for (int i = 0; i < W; i++) r.data[i] = data[i] && o.data[i]; return r; + } + __hostdev__ SimdMask operator|(SimdMask o) const { + SimdMask r; for (int i = 0; i < W; i++) r.data[i] = data[i] || o.data[i]; return r; + } +}; + +template +struct Simd { + using value_type = T; + using mask_type = SimdMask; + static constexpr size_t size() { return size_t(W); } + + std::array data{}; + + Simd() = default; + __hostdev__ Simd(T scalar) { data.fill(scalar); } // broadcast + __hostdev__ explicit Simd(const T* p, element_aligned_tag) { // load (ctor) + for (int i = 0; i < W; i++) data[i] = p[i]; + } + __hostdev__ T operator[](int i) const { return data[i]; } + __hostdev__ T& operator[](int i) { return data[i]; } + __hostdev__ void copy_from(const T* p, element_aligned_tag) { // load (member) + for (int i = 0; i < W; i++) data[i] = p[i]; + } + __hostdev__ void copy_to(T* p, element_aligned_tag) const { // store (member) + for (int i = 0; i < W; i++) p[i] = data[i]; + } + __hostdev__ Simd operator-() const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = -data[i]; return r; + } + __hostdev__ Simd operator+(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] + o.data[i]; return r; + } + __hostdev__ Simd operator-(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] - o.data[i]; return r; + } + __hostdev__ Simd operator*(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] * o.data[i]; return r; + } + __hostdev__ Simd operator/(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] / o.data[i]; return r; + } + __hostdev__ Simd& operator+=(Simd o) { + for (int i = 0; i < W; i++) data[i] += o.data[i]; return *this; + } + __hostdev__ Simd& operator-=(Simd o) { + for (int i = 0; i < W; i++) data[i] -= o.data[i]; return *this; + } + __hostdev__ Simd& operator*=(Simd o) { + for (int i = 0; i < W; i++) data[i] *= o.data[i]; return *this; + } + __hostdev__ Simd& operator/=(Simd o) { + for (int i = 0; i < W; i++) data[i] /= o.data[i]; return *this; + } + __hostdev__ SimdMask operator<(Simd o) const { + SimdMask m; for (int i = 0; i < W; i++) m.data[i] = data[i] < o.data[i]; return m; + } + __hostdev__ SimdMask operator<=(Simd o) const { + SimdMask m; for (int i = 0; i < W; i++) m.data[i] = data[i] <= o.data[i]; return m; + } + __hostdev__ SimdMask operator>(Simd o) const { + SimdMask m; for (int i = 0; i < W; i++) m.data[i] = data[i] > o.data[i]; return m; + } + __hostdev__ SimdMask operator>=(Simd o) const { + SimdMask m; for (int i = 0; i < W; i++) m.data[i] = data[i] >= o.data[i]; return m; + } + __hostdev__ SimdMask operator==(Simd o) const { + SimdMask m; for (int i = 0; i < W; i++) m.data[i] = data[i] == o.data[i]; return m; + } + __hostdev__ SimdMask operator!=(Simd o) const { + SimdMask m; for (int i = 0; i < W; i++) m.data[i] = data[i] != o.data[i]; return m; + } + // Bitwise and shift operators -- valid for integer element types. + __hostdev__ Simd operator|(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] | o.data[i]; return r; + } + __hostdev__ Simd operator&(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] & o.data[i]; return r; + } + __hostdev__ Simd operator^(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] ^ o.data[i]; return r; + } + // Per-lane variable shift (shift count from corresponding lane of o). + __hostdev__ Simd operator<<(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] << o.data[i]; return r; + } + __hostdev__ Simd operator>>(Simd o) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] >> o.data[i]; return r; + } + // Uniform shift: all lanes shifted by the same scalar count (vpsllw imm8 / vpsrlw imm8). + __hostdev__ Simd operator<<(T shift) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] << shift; return r; + } + __hostdev__ Simd operator>>(T shift) const { + Simd r; for (int i = 0; i < W; i++) r.data[i] = data[i] >> shift; return r; + } +}; + +template __hostdev__ +Simd operator+(T a, Simd b) { return Simd(a) + b; } +template __hostdev__ +Simd operator+(Simd a, T b) { return a + Simd(b); } +template __hostdev__ +Simd operator-(T a, Simd b) { return Simd(a) - b; } +template __hostdev__ +Simd operator-(Simd a, T b) { return a - Simd(b); } +template __hostdev__ +Simd operator*(T a, Simd b) { return Simd(a) * b; } +template __hostdev__ +Simd operator*(Simd a, T b) { return a * Simd(b); } +template __hostdev__ +Simd operator/(T a, Simd b) { return Simd(a) / b; } +template __hostdev__ +Simd operator/(Simd a, T b) { return a / Simd(b); } + +#endif // NANOVDB_USE_STDX_SIMD + +} // namespace experimental +} // namespace util + +// --------------------------------------------------------------------------- +// nanovdb::math::Min / Max / Select / Sqrt -- Simd overloads. Scalar +// overloads live in nanovdb/math/Math.h; defining the SIMD overloads here +// avoids a Math.h -> Simd.h dependency. +// --------------------------------------------------------------------------- +namespace math { +#ifdef NANOVDB_USE_STDX_SIMD +template +NANOVDB_FORCEINLINE util::experimental::Simd +Min(util::experimental::Simd a, util::experimental::Simd b) { + return std::experimental::min(a, b); +} +template +NANOVDB_FORCEINLINE util::experimental::Simd +Max(util::experimental::Simd a, util::experimental::Simd b) { + return std::experimental::max(a, b); +} +// 3-arg Select(mask, a, b): mask[i] ? a[i] : b[i], via TS v2's where() proxy. +template +NANOVDB_FORCEINLINE util::experimental::Simd +Select(util::experimental::SimdMask mask, + util::experimental::Simd a, + util::experimental::Simd b) { + auto result = b; + util::experimental::stdx::where(mask, result) = a; + return result; +} +template +NANOVDB_FORCEINLINE util::experimental::Simd +Sqrt(util::experimental::Simd a) { + return std::experimental::sqrt(a); +} +#else +template +__hostdev__ util::experimental::Simd +Min(util::experimental::Simd a, util::experimental::Simd b) { + util::experimental::Simd r; + for (int i = 0; i < W; i++) r[i] = a[i] < b[i] ? a[i] : b[i]; + return r; +} +template +__hostdev__ util::experimental::Simd +Max(util::experimental::Simd a, util::experimental::Simd b) { + util::experimental::Simd r; + for (int i = 0; i < W; i++) r[i] = a[i] > b[i] ? a[i] : b[i]; + return r; +} +template +__hostdev__ util::experimental::Simd +Select(util::experimental::SimdMask mask, + util::experimental::Simd a, + util::experimental::Simd b) { + util::experimental::Simd r; + for (int i = 0; i < W; i++) r[i] = mask[i] ? a[i] : b[i]; + return r; +} +template +__hostdev__ util::experimental::Simd +Sqrt(util::experimental::Simd a) { + util::experimental::Simd r; + for (int i = 0; i < W; i++) r[i] = Sqrt(a[i]); + return r; +} +#endif +} // namespace math + +} // namespace nanovdb + +#endif // end of NANOVDB_UTIL_SIMD_H_HAS_BEEN_INCLUDED diff --git a/nanovdb/nanovdb/util/Util.h b/nanovdb/nanovdb/util/Util.h index bdff640a97..4040e9bfd0 100644 --- a/nanovdb/nanovdb/util/Util.h +++ b/nanovdb/nanovdb/util/Util.h @@ -87,15 +87,22 @@ typedef unsigned long long uint64_t; #endif // if defined(__CUDACC__) || defined(__HIP__) -// NANOVDB_RESTRICT: cross-compiler no-alias hint for pointer parameters. -// GCC and Clang (including NVCC host compilation) spell it __restrict__, -// MSVC spells it __restrict. +// NANOVDB_RESTRICT: cross-compiler no-alias hint for pointer parameters #if defined(_MSC_VER) #define NANOVDB_RESTRICT __restrict #else #define NANOVDB_RESTRICT __restrict__ #endif +// NANOVDB_FORCEINLINE: force inlining at the call site +#if defined(_MSC_VER) +#define NANOVDB_FORCEINLINE __forceinline +#elif defined(__GNUC__) || defined(__clang__) +#define NANOVDB_FORCEINLINE inline __attribute__((always_inline)) +#else +#define NANOVDB_FORCEINLINE inline +#endif + // The following macro will suppress annoying warnings when nvcc // compiles functions that call (host) intrinsics (which is perfectly valid) #if defined(_MSC_VER) && defined(__CUDACC__) diff --git a/nanovdb/nanovdb/util/WenoStencil.h b/nanovdb/nanovdb/util/WenoStencil.h new file mode 100644 index 0000000000..5f5ea7fc58 --- /dev/null +++ b/nanovdb/nanovdb/util/WenoStencil.h @@ -0,0 +1,364 @@ +// Copyright Contributors to the OpenVDB Project +// SPDX-License-Identifier: Apache-2.0 + +/*! + \file WenoStencil.h + + \brief 19-point WENO5 stencil value container + out-of-band extrapolation + + fifth-order upwind Godunov's norm-square gradient. Templated on + a ValueType. + + `WenoStencil` holds the per-point values and activity flags. + ValueType is typically a raw scalar `float` (e.g. for scalar / CUDA code) + or a SIMD vector `Simd` (CPU batch path). The companion mask + type is auto-deduced (bool for raw scalars, ValueType::mask_type + otherwise): + + ValueType values [19] + MaskType isActive[19] + + For raw scalars the class reads as plain scalar arithmetic; for SIMD + vectors the same source compiles to whole-vector ops via Simd.h. The + class is pure-compute -- the caller owns any fill-side C-array storage + and the per-point load step. + + Grid-spacing scalars `mDx2` and `mInvDx2` stay scalar `float` and are + broadcast to ValueType at the point of use. + + Operations provided: + - extrapolate(absBackground) + Repair out-of-band lanes (isActive[k] == false) via + copysign(absBackground, values[innerPoint]), processed in + ascending-|d| order so the inner point is already resolved. + - normSqGrad(isoValue = 0) + Godunov's norm-square of the fifth-order WENO upwind gradient. +*/ + +#pragma once + +#include +#include // Pow2 + +#include +#include +#include +#include + +namespace nanovdb { + +namespace detail { + +// --------------------------------------------------------------------------- +// mask_of::type -- auto-deduced predicate type for ValueType T: +// - T::mask_type if T has a nested mask_type (e.g. Simd); +// - bool otherwise (e.g. raw scalar T=float). +// --------------------------------------------------------------------------- +template +struct mask_of { using type = bool; }; + +template +struct mask_of> { + using type = typename T::mask_type; +}; + +// --------------------------------------------------------------------------- +// Generic WENO5 reconstruction -- templated on ValueType in {float, +// Simd}. Nominally fifth-order finite-difference WENO (Chi-Wang +// Shu, "High Order Finite Difference and Finite Volume WENO Schemes and +// Discontinuous Galerkin Methods for CFD", ICASE Report No 2001-11 page 6; +// see also ICASE 97-65 for a more complete reference, Shu 1997). +// +// Given v1=f(x-2dx), v2=f(x-dx), v3=f(x), v4=f(x+dx), v5=f(x+2dx), returns +// an interpolated f(x+dx/2) with the property that +// (f(x+dx/2)-f(x-dx/2))/dx = df/dx(x) + error, with error fifth-order in +// smooth regions: O(dx) <= error <= O(dx^5). +// +// Body uses only primitives common to scalar and Simd ValueType +// (operator+/-/*, math::Pow2), so the same source compiles in both modes +// via the Simd backend in Simd.h. Integer coefficients carry explicit +// ValueType(...) casts for SIMD typed-operator dispatch, and float +// literals carry .f suffix because stdx::simd's broadcast ctor rejects +// double->float narrowing. +// +// ScalarType (defaults to float, deduced from scale2 if specified) is the +// arithmetic precision of the reference-magnitude epsilon scaling. scale2 +// stays scalar so callers can pass plain float/double grid constants +// without broadcasting. +// --------------------------------------------------------------------------- +template +__hostdev__ NANOVDB_FORCEINLINE ValueType +WENO5(const ValueType& v1, const ValueType& v2, const ValueType& v3, + const ValueType& v4, const ValueType& v5, + ScalarType scale2 = ScalarType(1)) // openvdb uses scale2 = 0.01 +{ + using math::Pow2; + const ValueType C = ValueType(13.f / 12.f); + // WENO is formulated for non-dimensional equations, here the optional scale2 + // is a reference value (squared) for the function being interpolated. For + // example if 'v' is of order 1000, then scale2 = 10^6 is ok. But in practice + // leave scale2 = 1. + const ValueType eps = ValueType(ScalarType(1.e-6) * scale2); + // {\tilde \omega_k} = \gamma_k / (\beta_k + \epsilon)^2 in Shu's ICASE report. + const ValueType A1 = ValueType(0.1f) / Pow2( + C * Pow2(v1 - ValueType(2)*v2 + v3) + + ValueType(0.25f) * Pow2(v1 - ValueType(4)*v2 + ValueType(3)*v3) + eps); + const ValueType A2 = ValueType(0.6f) / Pow2( + C * Pow2(v2 - ValueType(2)*v3 + v4) + + ValueType(0.25f) * Pow2(v2 - v4) + eps); + const ValueType A3 = ValueType(0.3f) / Pow2( + C * Pow2(v3 - ValueType(2)*v4 + v5) + + ValueType(0.25f) * Pow2(ValueType(3)*v3 - ValueType(4)*v4 + v5) + eps); + + return (A1 * (ValueType(2)*v1 - ValueType(7)*v2 + ValueType(11)*v3) + + A2 * (ValueType(5)*v3 - v2 + ValueType( 2)*v4) + + A3 * (ValueType(2)*v3 + ValueType(5)*v4 - v5)) + / (ValueType(6) * (A1 + A2 + A3)); +} + +// --------------------------------------------------------------------------- +// Generic Godunov's norm-square gradient -- templated on ValueType +// in {float, Simd} and a companion MaskType (bool for raw scalar, +// ValueType::mask_type for Simd). Differs from a textbook scalar form in +// shape: instead of an if/else on isOutside we compute both branches +// unconditionally and blend via math::Select, so the SIMD path has no +// control-flow divergence across lanes. At ValueType=float the scalar +// math::Select(bool, ValueType, ValueType) overload reduces this to the +// equivalent if/else semantics. +// --------------------------------------------------------------------------- +template +__hostdev__ NANOVDB_FORCEINLINE ValueType +GodunovsNormSqrd(MaskType isOutside, + ValueType dP_xm, ValueType dP_xp, + ValueType dP_ym, ValueType dP_yp, + ValueType dP_zm, ValueType dP_zp) +{ + using math::Min; using math::Max; using math::Pow2; using math::Select; + const ValueType zero(0.f); + + const ValueType outside = Max(Pow2(Max(dP_xm, zero)), Pow2(Min(dP_xp, zero))) // (dP/dx)^2 + + Max(Pow2(Max(dP_ym, zero)), Pow2(Min(dP_yp, zero))) // (dP/dy)^2 + + Max(Pow2(Max(dP_zm, zero)), Pow2(Min(dP_zp, zero))); // (dP/dz)^2 + + const ValueType inside = Max(Pow2(Min(dP_xm, zero)), Pow2(Max(dP_xp, zero))) + + Max(Pow2(Min(dP_ym, zero)), Pow2(Max(dP_yp, zero))) + + Max(Pow2(Min(dP_zm, zero)), Pow2(Max(dP_zp, zero))); + + return Select(isOutside, outside, inside); // |\nabla\phi|^2 +} + +} // namespace detail + +// --------------------------------------------------------------------------- +// WenoStencil -- pure-compute container for a 19-point WENO5 +// stencil state. Holds ValueType-typed values + MaskType-typed activity +// flags + scalar grid constants. Fill-side responsibility (scalar writes +// into any raw buffers, followed by a per-point load into this stencil's +// values[] / isActive[]) lives in the caller. See WenoStencil.md +// for usage patterns. +// --------------------------------------------------------------------------- +template +class WenoStencil +{ +public: + using MaskType = typename detail::mask_of::type; + + // --- Stencil-point offset types (compile-time only) ------------------- + // StencilPoint carries the offset as a type. StencilPoints is + // the 19-point tuple in the canonical idx ordering: + // idx 0 : center < 0, 0, 0> + // idx 1.. 6 : x-axis <-3,0,0> <-2,0,0> <-1,0,0> <+1,0,0> <+2,0,0> <+3,0,0> + // idx 7..12 : y-axis <0,-3,0> <0,-2,0> <0,-1,0> <0,+1,0> <0,+2,0> <0,+3,0> + // idx 13..18 : z-axis <0,0,-3> <0,0,-2> <0,0,-1> <0,0,+1> <0,0,+2> <0,0,+3> + template + struct StencilPoint { + static constexpr int di = i, dj = j, dk = k; + }; + + using StencilPoints = std::tuple< + StencilPoint< 0, 0, 0>, + StencilPoint<-3, 0, 0>, StencilPoint<-2, 0, 0>, StencilPoint<-1, 0, 0>, + StencilPoint<+1, 0, 0>, StencilPoint<+2, 0, 0>, StencilPoint<+3, 0, 0>, + StencilPoint< 0,-3, 0>, StencilPoint< 0,-2, 0>, StencilPoint< 0,-1, 0>, + StencilPoint< 0,+1, 0>, StencilPoint< 0,+2, 0>, StencilPoint< 0,+3, 0>, + StencilPoint< 0, 0,-3>, StencilPoint< 0, 0,-2>, StencilPoint< 0, 0,-1>, + StencilPoint< 0, 0,+1>, StencilPoint< 0, 0,+2>, StencilPoint< 0, 0,+3> + >; + + static constexpr int SIZE = int(std::tuple_size_v); + static constexpr int size() { return SIZE; } + + // Compute-side storage. At ValueType=float these are plain float / bool + // arrays; at ValueType=Simd they are whole-vector arrays. + ValueType values [SIZE]; + MaskType isActive[SIZE]; + + // Runtime grid-spacing constants -- plain scalars regardless of ValueType, + // broadcast inside normSqGrad(). + float mDx2{1.f}; // dx^2 -- fed to WENO5's epsilon via scale2 + float mInvDx2{1.f}; // 1 / dx^2 -- final normalisation in normSqGrad + + __hostdev__ WenoStencil() = default; + __hostdev__ explicit WenoStencil(float dx) + : mDx2(dx * dx), mInvDx2(1.f / (dx * dx)) {} + + // Compile-time named-point access: returns the index of point (i,j,k) + // in the StencilPoints tuple. + template + static constexpr int pointIndex() + { + constexpr int I = findPoint(std::make_index_sequence{}); + static_assert(I >= 0, "WenoStencil::pointIndex: point not in stencil"); + return I; + } + + // Resolve all SIZE stencil-point indices for the voxel at @a center via + // Acc::getValue(center + offset). Indices land in out[0..SIZE-1] in the + // StencilPoints tuple ordering. Acc is any NanoVDB accessor whose + // getValue() returns a value convertible to uint64_t (e.g. ValueOnIndex's + // sequential active-voxel indices). The accessor's path cache is reused + // across the SIZE getValue calls. + template + static void gatherIndices(Acc& acc, const Coord& center, uint64_t* out) + { + gatherIndicesImpl(acc, center, out, std::make_index_sequence{}); + } + + // ------------------------------------------------------------------ + // extrapolate -- sign-correct out-of-band lanes (isActive[k][i] == false) + // of values[k] by multiplying with Sign(values[innerPoint][i]). Active + // lanes are preserved. Center point (idx 0) is assumed always in-band + // and is not processed. + // + // Convention: the caller must pre-load inactive lanes of values[k] with + // the sidecar slot-0 background value (which the standard NanoVDB fill + // pattern produces automatically: a missing tap resolves to index 0, + // and sidecar[0] is the background). This routine then flips the sign + // when the parent (innerPoint) is negative, leaves it alone when the + // parent is positive, and zeros the lane when the parent is exactly 0. + // + // Processes 18 (point, innerPoint) pairs in ascending-|d| order so the + // inner point is already resolved when the outer point is touched; + // sign-inheritance through |d|=1 -> |d|=2 -> |d|=3 is automatic. + // ------------------------------------------------------------------ + __hostdev__ NANOVDB_FORCEINLINE void extrapolate(); + + // ------------------------------------------------------------------ + // normSqGrad -- Godunov's norm-square of the fifth-order WENO upwind + // gradient at the stencil center. Returns |\nabla\phi|^2. + // + // Six axial WENO5 reconstructions (one pair +/-x, +/-y, +/-z), then + // Godunov's upwind combinator driven by the sign of (center - iso). + // + // Call only after the stencil has been populated (see usage pattern in + // WenoStencil.md). extrapolate() before normSqGrad() is the + // typical pipeline shape but is not required by this method. + // ------------------------------------------------------------------ + __hostdev__ NANOVDB_FORCEINLINE ValueType normSqGrad(float iso = 0.f) const; + +private: + // Compile-time inverse map: (i,j,k) -> slot index in StencilPoints. + // Returns -1 if no matching point exists; pointIndex() turns that into + // a static_assert. + template + static constexpr int findPoint(std::index_sequence) + { + int result = -1; + ((std::tuple_element_t::di == i && + std::tuple_element_t::dj == j && + std::tuple_element_t::dk == k && + result < 0 ? (result = int(Is)) : 0), ...); + return result; + } + + // Parameter-pack expansion driving gatherIndices(): unrolls SIZE getValue + // calls into a single fold expression. + template + static void gatherIndicesImpl(Acc& acc, const Coord& center, uint64_t* out, + std::index_sequence) + { + ((out[Is] = static_cast(acc.getValue(center + Coord( + std::tuple_element_t::di, + std::tuple_element_t::dj, + std::tuple_element_t::dk)))), ...); + } + + // Hardcoded (point, innerPoint) pairs for the 19-point StencilPoints + // tuple, ordered by ascending |d| so the inner point is always already + // resolved when the outer point is processed. Indices match the + // StencilPoints tuple above. + // + // idx 0 : center ( 0, 0, 0) + // idx 1.. 6 : x-axis (-3..+3 in the order -3,-2,-1,+1,+2,+3) + // idx 7..12 : y-axis (-3..+3) + // idx 13..18 : z-axis (-3..+3) + static constexpr int kNumPairs = 18; + static constexpr int kPairs[kNumPairs][2] = { + // |d|=1 (inner point = center, idx 0) + { 3, 0}, { 4, 0}, // x: -1, +1 + { 9, 0}, {10, 0}, // y: -1, +1 + {15, 0}, {16, 0}, // z: -1, +1 + // |d|=2 (inner point = |d|=1 on same axis) + { 2, 3}, { 5, 4}, // x: -2 <- (-1), +2 <- (+1) + { 8, 9}, {11, 10}, // y + {14, 15}, {17, 16}, // z + // |d|=3 (inner point = |d|=2 on same axis) + { 1, 2}, { 6, 5}, // x: -3 <- (-2), +3 <- (+2) + { 7, 8}, {12, 11}, // y + {13, 14}, {18, 17} // z + }; +}; + +// --------------------------------------------------------------------------- +// extrapolate -- single-source implementation. At ValueType=float this is +// scalar code; at ValueType=Simd the same source compiles to +// whole-SIMD blends via the math::Select dispatch. +// --------------------------------------------------------------------------- +template +__hostdev__ NANOVDB_FORCEINLINE void +WenoStencil::extrapolate() +{ + const ValueType zero(0.f); + + for (int p = 0; p < kNumPairs; ++p) { + const int k = kPairs[p][0]; + const int kInner = kPairs[p][1]; + + // values[k] *= Sign(values[kInner]): + // parent > 0 -> values[k] (already pre-loaded with +background); + // parent < 0 -> -values[k]; + // parent == 0 -> 0. + const MaskType isPosParent = values[kInner] > zero; + const MaskType isNegParent = values[kInner] < zero; + const ValueType signed_k = math::Select(isPosParent, values[k], + math::Select(isNegParent, -values[k], zero)); + + // Active lanes keep their own value; inactive lanes get the sign-corrected background. + values[k] = math::Select(isActive[k], values[k], signed_k); + } +} + +// --------------------------------------------------------------------------- +// normSqGrad -- Godunov's upwind WENO norm-square gradient. Six axial +// WENO5 reconstructions drive GodunovsNormSqrd; point indices 0..18 follow +// the StencilPoints tuple ordering above. mInvDx2 and iso are broadcast to +// ValueType at the final combinator. +// --------------------------------------------------------------------------- +template +__hostdev__ NANOVDB_FORCEINLINE ValueType +WenoStencil::normSqGrad(float iso) const +{ + const ValueType* v = values; + + const ValueType dP_xm = detail::WENO5(v[ 2]-v[ 1], v[ 3]-v[ 2], v[ 0]-v[ 3], v[ 4]-v[ 0], v[ 5]-v[ 4], mDx2); + const ValueType dP_xp = detail::WENO5(v[ 6]-v[ 5], v[ 5]-v[ 4], v[ 4]-v[ 0], v[ 0]-v[ 3], v[ 3]-v[ 2], mDx2); + const ValueType dP_ym = detail::WENO5(v[ 8]-v[ 7], v[ 9]-v[ 8], v[ 0]-v[ 9], v[10]-v[ 0], v[11]-v[10], mDx2); + const ValueType dP_yp = detail::WENO5(v[12]-v[11], v[11]-v[10], v[10]-v[ 0], v[ 0]-v[ 9], v[ 9]-v[ 8], mDx2); + const ValueType dP_zm = detail::WENO5(v[14]-v[13], v[15]-v[14], v[ 0]-v[15], v[16]-v[ 0], v[17]-v[16], mDx2); + const ValueType dP_zp = detail::WENO5(v[18]-v[17], v[17]-v[16], v[16]-v[ 0], v[ 0]-v[15], v[15]-v[14], mDx2); + + return ValueType(mInvDx2) * + detail::GodunovsNormSqrd(v[0] > ValueType(iso), + dP_xm, dP_xp, dP_ym, dP_yp, dP_zm, dP_zp); +} + +} // namespace nanovdb diff --git a/nanovdb/nanovdb/util/WenoStencil.md b/nanovdb/nanovdb/util/WenoStencil.md new file mode 100644 index 0000000000..cedd05f63a --- /dev/null +++ b/nanovdb/nanovdb/util/WenoStencil.md @@ -0,0 +1,434 @@ +# WenoStencil — Single-Source Scalar/SIMD WENO5 Stencil Value Container + +Design reference for `nanovdb/nanovdb/util/WenoStencil.h`. Captures the +rationale behind the templated-on-lane-width class, the out-of-band +extrapolation algorithm, and the Godunov norm-square-gradient method. + +--- + +## 1. Motivation + +The WENO5 CPU pipeline assembles, per voxel batch, a 19-tap value +matrix with per-tap activity flags: + +``` +float values[Ntaps][W] -- real sidecar value, or background for OOB lanes +bool isActive[Ntaps][W] -- true iff the tap voxel is in the narrow band +``` + +Downstream phases are (1) **extrapolation** to repair out-of-band lanes +with sign-corrected background, and (2) **WENO5 arithmetic** (the fifth- +order upwind Godunov norm-square-gradient) to produce a per-voxel +`|∇φ|²` scalar. + +`WenoStencil` encapsulates the compute state and both operations: + +1. Storage of the 19-tap × W-lane Simd values + activity masks. +2. `extrapolate()` — ascending-|Δ| cascade, hardcoded tap pairs. +3. `normSqGrad()` — six axial WENO5 reconstructions + Godunov combinator. +4. A single-source pattern that keeps GPU (W=1, per-thread) code + textually identical to CPU SIMD (W>1) code — only the compile-time + lane width changes. + +--- + +## 2. Single-source scalar/SIMD design + +### 2.1 `Simd` as the storage type + +Storage is `Simd` / `SimdMask` directly — *not* raw +`float[W]` / `bool[W]` arrays: + +```cpp +template +class WenoStencil +{ +public: + using FloatV = util::Simd ; + using MaskV = util::SimdMask; + + FloatV values [SIZE]; + MaskV isActive[SIZE]; + + float mDx2{1.f}; // dx² — scalar, broadcast on use + float mInvDx2{1.f}; // 1 / dx² — scalar, broadcast on use + /* ... extrapolate(), normSqGrad() ... */ +}; +``` + +At W=1 the Simd types collapse to plain `float` / `bool` under the +array backend. At W>1 they are the backend-native SIMD type (stdx or +array wrapper). Memory layout at W>1 is identical to `float[SIZE][W]` +directly, so there is no storage-cost penalty — just a cleaner +compute-side type. + +### 2.2 Why first-class Simd storage (vs raw C arrays + `addr()` bridge) + +An earlier version used `std::conditional_t` as +the element type, with an `addr()` helper to normalize the W=1 scalar- +reference vs W>1 array-decay at every SIMD load/store site. That +design was rejected in favour of Simd-typed storage for three reasons: + +- **W=1 ceremony**. Approach 1 forced the scalar case to read + `FloatV val(addr(mValues[k]), element_aligned)` — loading a + `Simd` from a scalar reference. On the CUDA per-thread + path (where the scalar case is the *production* pipeline, not a + degenerate convenience) this ceremony survives into every + `__hostdev__` method that reads the stencil. Under Simd-typed + storage, W=1 reads `FloatV val = values[k]` — pure scalar code. + +- **Arithmetic boundary ceremony**. Approach 1 made `extrapolate()` + and a prospective `normSqGrad()` bracket every read/write with an + explicit Simd load or store. With Simd-typed storage, the + arithmetic reads as if scalar (`values[k] = util::where(...)`) and + the load/store boundary moves out to the caller where the Simd + values meet raw fill-side buffers. + +- **Symmetric, explicit boundary placement**. The caller already owns + the fill-side scalar-scatter loop (sidecar `sidecar[idx]` gathers are + inherently scalar-indexed per lane). Making the array→Simd + conversion an explicit caller-side step (`FloatV(raw_values[k], + element_aligned)`) preserves that ownership — the arithmetic class + doesn't care where its data came from. + +### 2.3 Scalar runtime constants, broadcast on use + +`mDx2` and `mInvDx2` stay plain `float` at every W. They are +broadcast to `FloatV` at the point of use inside `normSqGrad()`: + +```cpp +return FloatV(mInvDx2) * detail::GodunovsNormSqrd(...); +``` + +`vbroadcastss` is free on x86 (folds into the FMA consumer); identity +at W=1. Storing these as `Simd` instead would cost 64 +bytes × 2 of storage and hold two YMM registers across the entire +kernel lifetime for no benefit. + +### 2.4 Caller-owned fill-side buffers + +The class has **no** fill-side storage — no `mValues`/`mIsActive` raw C +arrays. Callers own whatever shape of raw data is natural for them. +For the CPU SIMD case that's typically a pair of stack-local +`alignas(64)` C arrays sized `[SIZE][W]`; for the CUDA per-thread case +no intermediate buffer is needed at all. + +This preserves the arithmetic class's purity and gives callers flex- +ibility — a different Phase-2 path (e.g. a future hardware-gather +fill) can populate the stencil using whatever pattern fits, without +the class having to expose a "fill API" that bakes in one shape. + +--- + +## 3. Extrapolation semantics + +### 3.1 The out-of-band problem + +For a narrow-band SDF, only the center tap `<0,0,0>` is guaranteed to +be in the active narrow band. Every other tap may land outside the +band for some lanes — for those lanes the sidecar fill writes +`values[k][lane] = sidecar[0]` (magnitude of the background, since the +sidecar builder pre-sets slot 0 to `floatGrid.background()`) and +`isActive[k][lane] = false`. + +Applying WENO5 arithmetic directly to the unsigned `|background|` +magnitude produces wrong gradients at the band boundary: the +reconstructed field would not track the sign of the underlying signed- +distance function. The standard fix is to **extrapolate from the next +inner tap's sign**: + +``` +if (!isActive[k][lane]) + values[k][lane] = copysign(|background|, values[innerTap][lane]) +``` + +The `|background|` magnitude is preserved; the sign is copied from +whichever "inner" tap (one step closer to center along the same axis) +best represents which side of the surface the lane belongs to. + +### 3.2 Inner-tap cascade — ascending |Δ| order + +| Outer tap |Δ| | Inner tap (source of sign) | +|---|---| +| `<±1,0,0>`, `<0,±1,0>`, `<0,0,±1>` | center `<0,0,0>` | +| `<±2,0,0>`, `<0,±2,0>`, `<0,0,±2>` | `<±1,0,0>`, `<0,±1,0>`, `<0,0,±1>` | +| `<±3,0,0>`, `<0,±3,0>`, `<0,0,±3>` | `<±2,0,0>`, `<0,±2,0>`, `<0,0,±2>` | + +Processing taps in ascending-|Δ| order guarantees the inner tap is +already resolved (real value or previously extrapolated) when the +outer tap is processed. Sign propagation through a |Δ|=1 → |Δ|=2 → +|Δ|=3 chain is transitive — no special casing. + +### 3.3 The `kPairs[]` table + +The inner-tap relationship is WENO5-specific and hardcoded as a static +table inside the class: + +```cpp +static constexpr int kPairs[18][2] = { + // |Δ|=1 (inner = center, idx 0) + {3,0},{4,0},{9,0},{10,0},{15,0},{16,0}, + // |Δ|=2 (inner = |Δ|=1 on same axis) + {2,3},{5,4},{8,9},{11,10},{14,15},{17,16}, + // |Δ|=3 (inner = |Δ|=2 on same axis) + {1,2},{6,5},{7,8},{12,11},{13,14},{18,17} +}; +``` + +Indices match the `WenoStencil::Taps` tuple defined in +`WenoStencil.h` (same ordering as `WenoPt::idx` in +`nanovdb/math/Stencils.h`). Center tap (idx 0) is not processed — +assumed always in-band. + +**Why hardcoded, not template-derived:** a generic scheme would walk +`Taps` at compile time and derive inner-tap indices from |Δ| and axis +alignment. For a single stencil the table is 18 entries, reads +directly, and makes the cascade ordering self-documenting. Worth +revisiting if we add Weno7 or other axis-aligned WENO variants. + +### 3.4 `extrapolate()` implementation + +```cpp +template +void WenoStencil::extrapolate(float absBackground) +{ + const FloatV absBg(absBackground); + const FloatV zero (0.f); + + for (int p = 0; p < kNumPairs; ++p) { + const int k = kPairs[p][0]; + const int kInner = kPairs[p][1]; + + // copysign(absBg, inner): +absBg if inner >= 0, else -absBg. + const MaskV isNegInner = zero > values[kInner]; + const FloatV extrap = util::where(isNegInner, -absBg, absBg); + + // Active lanes keep their own value; inactive take the extrapolated sign-corrected background. + values[k] = util::where(isActive[k], values[k], extrap); + } +} +``` + +No `addr()`. No `element_aligned`. Reads `values[]` as Simd, +operates as Simd, writes Simd — the kernel body never drops to the +underlying scalar/array representation. + +**Per-pair cost (W=16, AVX2):** + +| Op | Cycles (est.) | +|----|--------------:| +| compare `zero > values[kInner]` | 1 | +| `where(isNegInner, -absBg, absBg)` (vblendvps) | 1 | +| `where(isActive[k], values[k], extrap)` (mask convert + vblendvps) | 1–2 | +| **≈ 4 cycles / pair** (values[] register-resident) | + +Total: 18 pairs × ~4 cycles = ~72 cycles per call — lower than the +Approach-1 estimate of ~7 cycles/pair, because we no longer do the +explicit per-pair load. The Simd values live in YMM registers across +the pair loop. Measured end-to-end cost (sidecar-stencil-extrap minus +sidecar-stencil): ~0.14–0.19 ns/voxel on 24 threads. + +--- + +## 4. Godunov norm-square-gradient + +### 4.1 Semantics — tracking the ground-truth scalar + +`nanovdb::math::WenoStencil::normSqGrad(isoValue)` in +`nanovdb/math/Stencils.h` is the ground-truth scalar reference. Its +body: + +```cpp +const ValueType* v = mValues; +const RealT + dP_xm = WENO5(v[2]-v[1], v[3]-v[2], v[0]-v[3], v[4]-v[0], v[5]-v[4], mDx2), + dP_xp = WENO5(v[6]-v[5], v[5]-v[4], v[4]-v[0], v[0]-v[3], v[3]-v[2], mDx2), + dP_ym = ..., dP_yp = ..., dP_zm = ..., dP_zp = ...; +return mInvDx2 * GodunovsNormSqrd(v[0] > isoValue, dP_xm, dP_xp, dP_ym, dP_yp, dP_zm, dP_zp); +``` + +`WenoStencil::normSqGrad(iso)` is a line-for-line transliteration of +the same body, with three adaptations: + +1. `v = values` (Simd-typed storage, not `mValues` scalar array). +2. Local `dP_*` are `FloatV` rather than `RealT`. +3. The final `mInvDx2 * ...` multiplication broadcasts the scalar + `mInvDx2` to `FloatV` (via `FloatV(mInvDx2)`); at W=1 this is a + no-op. + +### 4.2 `WENO5` — generic over scalar and Simd + +The six axial reconstructions are driven by a single free-function +template `nanovdb::detail::WENO5` that mirrors +`nanovdb::math::WENO5` exactly (Shu ICASE +smoothness indicators, 0.1/0.6/0.3 linear weights, static_cast at the +end replaced by the trailing division). Structure is the same; only +the literal constants are wrapped in `RealT(...)` constructors to +broadcast at W>1. Lives in `nanovdb::detail` to keep the naming +convention close to the ground-truth without colliding with the +existing `nanovdb::math::WENO5`. + +### 4.3 `GodunovsNormSqrd` — `where`-based, no control flow + +The scalar ground-truth has a runtime `if (isOutside) { … } else { … }`. +The generic-T version computes both branches unconditionally and +blends via `util::where`: + +```cpp +template +inline T GodunovsNormSqrd(MaskT isOutside, + T dP_xm, T dP_xp, T dP_ym, T dP_yp, T dP_zm, T dP_zp) +{ + const T zero(0.f); + const T outside = max(Pow2(max(dP_xm, zero)), Pow2(min(dP_xp, zero))) // (dP/dx)² + + max(Pow2(max(dP_ym, zero)), Pow2(min(dP_yp, zero))) + + max(Pow2(max(dP_zm, zero)), Pow2(min(dP_zp, zero))); + const T inside = max(Pow2(min(dP_xm, zero)), Pow2(max(dP_xp, zero))) + + max(Pow2(min(dP_ym, zero)), Pow2(max(dP_yp, zero))) + + max(Pow2(min(dP_zm, zero)), Pow2(max(dP_zp, zero))); + return where(isOutside, outside, inside); +} +``` + +At `T=float, MaskT=bool` this compiles to scalar code with both +branches speculatively evaluated — slightly slower than the +ground-truth's branchy form for scalar workloads in isolation, but +identical correctness. At `T=Simd, MaskT=SimdMask` +the branches are unconditional SIMD compute plus a `vblendvps` — +no lane-divergent branches, no scalarisation. + +Per-lane cost of the full `normSqGrad`: + +| Phase | Ops | +|-------|-----| +| 6× axial WENO5 | ~60 mul/add/fma + 6× reciprocals (the `0.N / Pow2(…)` terms) | +| Godunov: 12× max/min + 12× mul + 5× add, both branches | ~29 ops | +| Blend + final multiply by FloatV(mInvDx2) | 2 ops | + +Roughly ~100 arithmetic ops per voxel per `normSqGrad` call. At W=16 +AVX2 that's ~100 / 16 ≈ 6.3 cycles/voxel × some FMA-throughput factor +— call it 2 ns/voxel single-threaded, 0.1 ns/voxel on 24 threads. +(To be validated by measurement; see §7.1 Future work.) + +--- + +## 5. API usage + +### 5.1 CPU SIMD — caller-owned raw buffers, explicit load + +```cpp +// Caller owns its scalar-scatter target. +alignas(64) float raw_values[SIZE][W]; +alignas(64) bool raw_active[SIZE][W]; + +nanovdb::WenoStencil stencil(dx); + +// Fill — pure scalar stores, guaranteed fast codegen on all backends. +for (int k = 0; k < SIZE; ++k) { + for (int i = 0; i < W; ++i) { + const uint64_t idx = /* sidecar index for tap k, lane i */; + raw_values[k][i] = sidecar[idx]; + raw_active[k][i] = (idx != 0); + } +} + +// Bridge — one SIMD load per tap. +for (int k = 0; k < SIZE; ++k) { + stencil.values [k] = FloatV(raw_values[k], util::element_aligned); + stencil.isActive[k] = MaskV (raw_active[k], util::element_aligned); +} + +// Arithmetic — reads/writes stencil.values[] as Simd in place. +stencil.extrapolate(std::abs(sidecar[0])); +FloatV normSq = stencil.normSqGrad(/* iso = */ 0.f); + +// Simd → scalar bridge at the output side if downstream consumers are scalar. +alignas(64) float normSq_lanes[W]; +util::store(normSq, normSq_lanes, util::element_aligned); +``` + +### 5.2 CUDA scalar — no intermediate buffer + +```cpp +nanovdb::WenoStencil<1> stencil(dx); +for (int k = 0; k < SIZE; ++k) { + const uint64_t idx = gather_index_for_tap(k); + stencil.values [k] = sidecar[idx]; + stencil.isActive[k] = (idx != 0); +} + +stencil.extrapolate(fabsf(sidecar[0])); +float normSq = stencil.normSqGrad(); +``` + +`FloatV` is `float` at W=1; direct scalar assignment. `MaskV` is +`bool`. No raw buffers, no `element_aligned`, no load loops — the +per-thread path reads as pure scalar arithmetic. + +### 5.3 Compile-time named-tap access + +```cpp +constexpr int ctr = WenoStencil::tapIndex<0, 0, 0>(); +FloatV centerValue = stencil.values[ctr]; + +constexpr int xm3 = WenoStencil::tapIndex<-3, 0, 0>(); +FloatV xm3Value = stencil.values[xm3]; +``` + +`tapIndex()` forwards to a private static `findTap` helper +inside `WenoStencil`, static-asserting at compile time that the +requested tap exists in the `Taps` tuple. + +--- + +## 6. Ownership boundaries + +``` +┌───────────────────────────────────────────────────────────────────┐ +│ Caller │ +│ alignas(64) float raw_values[SIZE][W]; ← fill-side buffer │ +│ alignas(64) bool raw_active[SIZE][W]; │ +│ │ +│ │ +│ │ +│ for k: stencil.values[k] = FloatV(raw_values[k], ...); │ +│ stencil.isActive[k] = MaskV (raw_active[k], ...); │ +│ ═══════════════════════════════════════════════════ Simd border │ +├───────────────────────────────────────────────────────────────────┤ +│ WenoStencil │ +│ FloatV values [19]; MaskV isActive[19]; │ +│ float mDx2, mInvDx2; │ +│ extrapolate() / normSqGrad() — Simd-in / Simd-out, pure compute │ +├───────────────────────────────────────────────────────────────────┤ +│ Caller │ +│ ═══════════════════════════════════════════════════ Simd border │ +│ util::store(normSq, normSq_lanes, util::element_aligned); │ +│ │ +└───────────────────────────────────────────────────────────────────┘ +``` + +Array↔Simd bridges exist only at the two explicit boundaries where +scalar-indexed I/O meets SIMD-parallel compute. Inside `WenoStencil` +everything is Simd; outside the class the caller chooses whatever +scalar pattern fits its source/sink. + +--- + +## 7. Future work + +### 7.1 Alternative stencils + +If/when Weno7 or a non-axis-aligned stencil is needed, the class +would specialise on a stencil-policy template parameter rather than +hardcoding the 19-tap WENO5 shape: + +```cpp +template +class AxisAlignedStencil { /* derive kPairs at compile time */ }; +``` + +The `kPairs` table would be generated from `StencilPolicy::Taps` via a +constexpr pass that finds, for each tap, the same-axis neighbour with +|Δ| = |tap.Δ| − 1. Not needed until a second axis-aligned stencil +exists. +