Skip to content

vulkan hardware accel#39

Open
SimonDanisch wants to merge 51 commits intomasterfrom
sd/vk-hw-accel
Open

vulkan hardware accel#39
SimonDanisch wants to merge 51 commits intomasterfrom
sd/vk-hw-accel

Conversation

@SimonDanisch
Copy link
Copy Markdown
Member

No description provided.

jkrumbiegel and others added 30 commits March 1, 2026 21:02
Two issues fixed:

1. Runtime NTuple indexing returns 0 on Metal. Replace PERMUTATIONS_4WAY
   (nested NTuples indexed at runtime) with PACKED_PERMUTATIONS_4WAY
   (bit-packed UInt32 values appended to the Sobol matrices GPU array).
   lookup_permutation now takes the matrices array and uses bit shifts.

2. @nexprs unrolled loops cause Metal compiler crashes or miscompilation
   when the loop body references non-constant kernel arguments. Replace
   with regular for loops in sobol_sample.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The D65 illuminant spectrum was stored as NTuple{107, Float32} and indexed
at runtime, which returns 0 on Metal (same root cause as the Sobol fix).

Add d65_values field to RGBToSpectrumTable as a GPU array, and add
GPU-compatible sample_d65/sample_d65_spectral overloads that read from
the array instead of the NTuple constant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- free!(::VolPathState) and free!(::WorkQueue) for proper GPU memory release
- VolPath accumulator clearing when iteration_index==0
- HW RT minor fix
- New test: caching & GC correctness
SamplerIntegrators (Whitted) don't have cached state to clear,
but RayMakie's render path calls clear!(integrator) generically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lava is now a [weakdeps] instead of [deps]. The hardware ray tracing
code (hw-rt.jl) is loaded via ext/HikariLavaExt.jl when Lava is
available, keeping Hikari usable without Vulkan installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clamp _gpu_ndrange to 1 instead of 0. ndrange=0 causes a DivideError
in KernelAbstractions.partition on AMDGPU (and potentially other backends).
The kernel's bounds check (idx > queue_size) handles the empty case safely.
…tprocess

- Fix hw-rt volpath: expanded integrator with proper camera setup
- Film iteration tracking fix (isapprox tolerance for camera comparison)
- Denoiser: adapt to Lava texture API changes
- Postprocess: use updated film access patterns
Camera & film:
- Fix horizontal flip: bypass LookAt roundtrip in scene builder, pass
  pbrt transform directly as Transformation to PerspectiveCamera
- Fix 1-pixel offset: film coords x+0.5 → x-0.5 for 1-based pixels

Dielectric BSDF:
- Fix smooth dielectric f/pdf: return f=R*kr/cos, pdf=R (reflection)
  and f=T*kt/cos/etap², pdf=T (transmission) matching pbrt DielectricBxDF
- Implement rough dielectric microfacet sampling (was missing entirely —
  roughness parameter was ignored, always sampled as specular)
- Include Fresnel probability R or T in rough sampling PDF

Transport mode (1/etap² Radiance correction):
- Add radiance_mode parameter to sample_dielectric_interface and
  eval_dielectric_interface, applying ft/=etap² for transmission
- Fix evaluate_bsdf_spectral in LayeredBxDF: wos uses Radiance mode,
  wis uses Importance mode (matching pbrt !mode convention)
- Fix LayeredBxDF exit eta: return 1.0 (matching pbrt hardcoded value)

Spot light:
- Fix cone direction: Vec3f cast for world_to_light(-wi) — Point3f
  subtraction kept Point3f type, including translation in transform

Integrator parameters:
- Parse regularize (default false), russian_roulette_depth (default 1),
  maxcomponentvalue (default Inf) from pbrt scene files
- Remove dead multi-material-eval.jl (unused :per_type/:sorted modes)

Test suite:
- Rewrite with tile-based image comparison (Makie ReferenceTests approach)
- Threshold 0.07 tile score catches spatial errors like flipped images
The PDF function used lerp(0.9, 1/(4π), pdfSum) which computed
(1-pdfSum)*0.9 + pdfSum/(4π) instead of the intended
0.1/(4π) + 0.9*pdfSum (pbrt-v4's Lerp(0.9, 1/(4π), pdfSum)).

Hikari's lerp(v1, v2, t) = (1-t)*v1 + t*v2, so the correct call is
lerp(1/(4π), pdfSum, 0.9) to match pbrt's convention.

The wrong PDF caused MIS to massively overweight rare samples,
producing visible fireflies in coated material highlights.
Match pbrt-v4's Frame::FromXZ(Normalize(dpdus), ns) for the BSDF
local frame. Pass dpdus through dispatch to all material functions.
This makes the local frame identical to pbrt's, so RNG-seeded
random walks inside coated materials produce the same noise patterns.
…red interface functions

The pbrt parser for 'dielectric' material only parsed 'roughness',
missing 'uroughness'/'vroughness'. Scenes using anisotropic roughness
got roughness=0 (smooth glass). Fixed by parsing both parameters
matching the conductor parser.

Also simplified rough dielectric sample/evaluate to delegate to
sample_dielectric_interface and eval_dielectric_interface from
common.jl, which are proven correct in coated material tests.
…peline

- Split CI into 4 parallel/sequential jobs: test, pbrt-comparison, docs-build, deploy-docs
- Install lavapipe for Vulkan GPU tests
- pbrt comparison renders 134 scenes and uploads gallery as artifact
- docs-build runs makedocs with @example blocks, uploads build artifact
- deploy-docs merges both artifacts and deploys to gh-pages
- Add test/pbrt/Project.toml for comparison script dependencies
- Remove deploydocs from make.jl (handled by deploy step)
- Fix FilmSensor removal in examples/cat_scene.jl
- Cache rays/results buffers on Film (aux_rays, aux_results RefValue slots)
  to eliminate 64 per-frame GPU allocations from fill_aux_buffers! in the
  HW RT path. Verified via allocation tracking: 0 allocs after warmup.

- Fix camera transform helpers to use Raycore.transform_point/direction
  instead of calling camera_to_world as a functor.

- Remove SOA overrides for VPMaterialEvalWorkItem, VPHitSurfaceWorkItem,
  VPMediumSampleWorkItem, VPMediumScatterWorkItem (only VPRayWorkItem,
  VPShadowRayWorkItem, VPEscapedRayWorkItem, VPRaySamples use SOA).

- GPU majorant grid building via build_majorant_kernel!.

- Update Adapt.adapt_structure and free! for new Film fields.
pbrt-v4 `Material "interface"` returns nullptr — the surface has no BSDF
and rays pass through transparently, only the medium swap fires. Hikari
was mapping it to `Dielectric(Kr=0, Kt=1, index=1)`, which still triggers
BSDF sampling and (worse) makes shadow rays treat the surface as opaque
at intersection.jl:351. Volume bounding meshes wrapped this way render
as a uniformly-shadowed cuboid that obscures the medium contents.

Fix:
- New `NullMaterial <: Material` marker type whose `push!` returns an
  invalid `SetKey()`, so the existing `!is_valid(mi.material)` checks
  in shadow- and primary-ray code identify the null boundary.
- Skip-intersection guard at the top of `evaluate_material_inner!` and
  `surface_direct_lighting_inner!`: when the material is invalid and the
  interface is a medium transition, no BSDF sample, no direct lighting,
  no depth increment — push the ray past the boundary with the swapped
  medium to next_ray_queue (mirrors pbrt cpu/integrators.cpp:420,568,681
  `if (!bsdf) SkipIntersection`).
- pbrt parser: `Material "interface"` -> `NullMaterial()`.

Reference test: `medium_null_interface_homog.pbrt` exercises the path
end-to-end (rgb-tinted homogeneous fog inside a null-interface cube).
`run_comparison.jl` switched to Lava backend and the soft-scope bug on
`n_rendered` hoisted into `render_all_missing` so progress prints work.
…apt_scene!

The integrator's cache fields were all typed `::Any` with `(scene_id,
adapted) or nothing` / `(camera_pos, SetKey) or nothing` tuple-juggling,
and downstream packages (RayMakie.screen) poked at the `_`-prefixed
fields directly.

- `adapted_scene::Any` + `adapted_scene_id::UInt64` replace the
  `(scene_id, adapted) or nothing` tuple. `id == 0` means invalid.
- `initial_medium_camera_pos` + `initial_medium_key` replace the
  `(camera_pos, SetKey) or nothing` tuple.
- `filter_sampler_gpu` dropped the `_` prefix.
- New public function: `get_or_adapt_scene!(vp, backend, scene)` is the
  canonical way to obtain the adapted scene. External callers use it
  instead of reaching for the field directly.
- close(vp) clears all cache fields consistently (including id reset).
- Tests updated to read the new fields.
Hit shaders used to destructure closest_hit into 4 vars (dropping the
instance info) and look up material via `primitive.metadata.medium_interface_idx`
— forcing one BLAS per distinct material.  That's the meshscatter BLAS
explosion: a 100-point scatter produced 100 identical-geometry BLASes.

Now:

- intersection.jl gains `resolve_mi_idx(accel, inst_idx, primitive)`.
  If `accel.instances[inst_idx].instance_id != 0`, use it as the
  medium_interface_idx override; otherwise fall back to the triangle's
  per-face metadata.  Single lookup, same CPU cost as before.
- All 4 closest_hit sites in volpath/intersection.jl destructure
  `inst_idx` and route through `resolve_mi_idx`.
- surface_interaction.jl's legacy intersect! path now reads the correct
  instance descriptor via `accel.instances[inst_idx]` (it already
  expected this semantics — it was always broken for multi-instance
  TLAS).  The SurfaceInteraction's `instance_id` field now stores the
  array index.

Verified end-to-end: 32x32 VolPath render through the Lava backend
produces 464 visible pixels, no DEVICE_LOST.
…nstancing

N instances share one BLAS.  Each material is registered as its own
MediumInterface and becomes the instance's interface override
(InstanceDescriptor.instance_id); the hit shader resolves it via
resolve_mi_idx.

Previously, meshscatter with N distinct colors forced N separate
build_blas + N per-triangle-tagged meshes — identical geometry
duplicated N times (~1 GB/frame memory growth for the dolphin demo
with ~100 arrow instances per scene update).

Rejects emissive materials explicitly: per-instance area lights would
need per-instance-transformed geometry, which is a separate feature.
Emitters still use the single-transform push!.

Verified: 4-color sphere scatter renders with 2 BLASes total (floor +
1 sphere BLAS), all 4 instance colors visible in the output.
…ndex

Mirrors the SW override fix (commit on the sd/vk-hw-accel branch) on the
hardware-adapted wavefront path:

- `process_shadow_round_kernel!`: triangle lookup now uses
  `result.instance_id` (gl_InstanceID → `off_gpu` is now keyed per
  instance, not per BLAS).  Material resolves via the override in
  `result.instance_custom_index` falling back to the triangle's
  per-face metadata — same rule as SW.
- `hw_raygen_shadow`: reads payload slot 6 (gl_InstanceID) and writes
  it into `RTHitResult.instance_id`.
- `hw_anyhit_shadow`: uses `rt_instance_id(accel)` for triangle lookup
  and applies the same override rule.
- `resolve_mi_idx` gets a second method on `::Any` for accelerators
  whose `closest_hit` 5th return is already the override (HW path),
  disambiguated from the `StaticTLAS` path that returns the array index.

Verified: 4-color sphere scatter through the Lava HW RT path renders
all 4 instance colors (R, G, B channels all > 1), device_lost=false.
Covers the Phase-C refactor surface:

- `InstanceDescriptor.instance_id` semantics (default=0 inherit,
  nonzero=override, round-trip through explicit push! kwargs).
- `Raycore.closest_hit` returns the 1-based instance array index.
- `push!(scene, mesh, materials, transforms)`: builds 1 BLAS + N
  instances (not N BLASes), per-instance material routed through
  `instance_id`.  Emissive materials rejected with a clear error.
- SW render end-to-end: 4 pure-R/G/B/yellow spheres all produce their
  distinct color channels in the output.
- HW TLAS structure: 1 BLAS + 4 scatter instances + distinct nonzero
  `instance_custom_indices` overrides.  (HW render works in isolation;
  a cross-run lifetime bug that DEVICE_LOSTs a second HW render's
  rt_indirect at ~dispatch 104 is tracked as follow-up — orthogonal
  to Phase C.)
- Rapid push/delete cycles: 10 * (push 8 / delete 8) -> BLAS count
  stays <= 2 (would have been 80 before the fix).
- update_instance_transform! preserves instance_id.
- instance_id=0 correctly falls through to per-triangle
  medium_interface_idx.

46 tests, all passing.
* Shadow round 2+ now traces all n_rays_gpu rays per round.  Previously
  dispatched only `active_counter`, but extract_shadow_rays2_kernel writes
  ray_buf in state-index order without compaction — actives at indices
  >= active_counter silently re-used round-1 results.  Degenerate rays
  for inactive states are cheap RT misses.
* finalize_shadow_kernel and process_shadow_round_kernel gate on
  queue_size[1] (not just `cap`) — fixes iter-6 GART PERMISSION_FAULT
  cascade caused by stale shadow_states past the live queue size.
* Remove dead count_active_shadows_kernel + hw_shadow_counter +
  hw_depth_ray_buf + hw_depth_result_buf fields.
* New tests: test_cascade_scene_switch_mwe, test_hw_accel_stability,
  test_volpath_per_iter_lifecycle, plus pbrt suite scaffolding.
VolPath integrator gains cull_mask::UInt32 = UInt32(0xFF) field.
hw-rt's vp_trace_rays! and vp_trace_shadow_rays! thread
integrator.cull_mask to batch_trace_indirect and
trace_closest_hits_indirect! respectively. fill_aux_buffers! gains a
cull_mask kwarg (default 0xFF) for callers that have integrator access.

Default 0xFF preserves all existing renders. Custom masks let callers
filter to specific instance subsets in a unified TLAS, e.g., 0x04 for
rendering instances when physics-only (0x02) instances coexist.

Also required: batch_trace_indirect in Lava gains cull_mask kwarg --
that change lands in dev/Lava/ alongside this commit.

P3.2 of the GPU rigid-body pipeline. Builds on Lava P3.1 (commit 27d951c).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants