Docs: ArrayView<T>.CopyToHostAsync partial readback (4.9.3)

LostBeard · LostBeard · commit 4430b676ab4a · 2026-04-29T09:48:50.000-04:00
Adds a section to Docs/memory-and-buffers.md covering the new partial-readback
extension shipped in 4.9.3: AV1 YUV plane separation example, per-backend
implementation table (WebGPU CopyBufferToBuffer + mapAsync, WebGL GL-worker
partial readback, Wasm SAB Uint8Array slice, CUDA / OpenCL / CPU view.CopyToCPU
native partial copy), both extension overloads (ArrayView&lt;T&gt; + ArrayView1D),
and guidance on when not to use it.

README "Recent Highlights" updated with a 4.9.3 entry pointing at the doc.
diff --git a/Docs/memory-and-buffers.md b/Docs/memory-and-buffers.md
@@ -190,6 +190,62 @@ float[] cachedResults = new float[bufferLength];
 await buffer.CopyToHostAsync(cachedResults);
 ```
 
+### ArrayView&lt;T&gt;.CopyToHostAsync — Partial Readback (4.9.3+)
+
+Reads only a sub-range of a GPU buffer to a host array. The byte range outside the view never crosses the device-host boundary - this is a real per-backend partial copy, **not** a full-buffer readback followed by a CPU-side slice.
+
+Use this when a single GPU buffer holds multiple logical regions (per-channel image planes, per-tensor model outputs, per-frame audio chunks, etc.) and you need each region as its own host array:
+
+```csharp
+using SpawnDev.ILGPU;
+
+// One GPU buffer with three logical regions (Y / U / V planes for YUV 4:2:0):
+var y = await dRecon.View.SubView(0,            yLen ).CopyToHostAsync();
+var u = await dRecon.View.SubView(yLen,         uvLen).CopyToHostAsync();
+var v = await dRecon.View.SubView(yLen + uvLen, uvLen).CopyToHostAsync();
+```
+
+Each call only transfers its own slice's bytes. Compare with the full-buffer pattern, which reads the whole buffer and slices on the CPU:
+
+```csharp
+// AVOID — reads the entire dRecon buffer to host every call,
+// then slices on the CPU. Fine for small buffers, wasteful for large ones.
+var full = await dRecon.CopyToHostAsync<byte>();
+var y = new byte[yLen];  Buffer.BlockCopy(full, 0,             y, 0, yLen);
+var u = new byte[uvLen]; Buffer.BlockCopy(full, yLen,          u, 0, uvLen);
+var v = new byte[uvLen]; Buffer.BlockCopy(full, yLen + uvLen,  v, 0, uvLen);
+```
+
+**Per-backend implementation** (no fallback to full-buffer + slice on any backend):
+
+| Backend | Underlying primitive |
+|---|---|
+| **WebGPU** | `queue.CopyBufferToBuffer(srcBuf, srcByteOffset, staging, 0, byteCount)` -> `mapAsync(Read, 0, byteCount)`. Staging is sized to the slice, not the parent buffer. |
+| **WebGL** | GL-worker `ReadbackAndGetUint8ArrayAsync(buf, sourceByteOffset, byteCount)` partial range path. |
+| **Wasm** | `new Uint8Array(SharedBuffer, byteOffset, byteCount)` window onto the SAB slot. The rest of wasm linear memory is not touched. |
+| **CUDA / OpenCL / CPU** | ILGPU's native `view.CopyToCPU(target)`. The view's start offset and length encode the partial range, so this is one `cudaMemcpy` / `clEnqueueReadBuffer` / direct memcpy of just the slice's bytes. |
+
+**Two overloads** are provided so that `MemoryBuffer1D.View.SubView(...)` resolves naturally without an explicit cast:
+
+```csharp
+public static Task<T[]> CopyToHostAsync<T>(this ArrayView<T> view)
+    where T : unmanaged;
+
+public static Task<T[]> CopyToHostAsync<T, TStride>(this ArrayView1D<T, TStride> view)
+    where T : unmanaged
+    where TStride : struct, IStride1D;
+```
+
+The `ArrayView1D` overload forwards to the `ArrayView<T>` overload via `view.BaseView`, which is already the sliced range on a SubView'd 1D view.
+
+**Throws:**
+- `InvalidOperationException` if the view has no backing buffer.
+- `ArgumentOutOfRangeException` if the view's byte range exceeds the buffer's length.
+
+**When NOT to use this overload:**
+- You want the entire buffer's contents - use `buffer.CopyToHostAsync<T>()` directly. The `MemoryBuffer` overload exists for that case and avoids the SubView object construction.
+- You're writing into a pre-allocated array - use `buffer.CopyToHostAsync(targetArray)` for the per-frame render loop pattern. The partial-readback overload always allocates a fresh `T[]`.
+
 ### CopyToHostUint8ArrayAsync — JavaScript Interop
 
 Returns a JavaScript `Uint8Array` for direct use with browser APIs (Canvas, WebGL textures, etc.):
diff --git a/README.md b/README.md
@@ -9,7 +9,9 @@ Write parallel compute code in C# and let the library pick the best available ba
 
 ## Recent Highlights
 
-**4.9.2 (current):** OpenCL phi-binding-per-target codegen fix (Tuvok's `Av1RangeDecoderGpu.DecodeCdfQ15` round-trip green); rolls up the rc.7-rc.30 series (signed `Div by pow2` correctness, NaN/Inf codegen across WGSL/GLSL/Wasm/OpenCL, Wasm wait/notify-free + worker-headroom default, helper fn-definition emission for compile-cliff avoidance, `AcceleratorRequirements` capability gating, T4-drift + four-package version-sync CI guards).
+**4.9.3 (current):** New `ArrayView<T>.CopyToHostAsync()` extension - real per-backend partial readback for sub-views. One device buffer can be split into per-channel / per-plane host arrays without the host iterating over the full buffer. WebGPU `Half` NaN/Inf bit-pattern codegen fix (multi-compare paths now route f16 through `bitcast<u32>(vec2<f16>(x, 0.0h))` instead of the invalid `bitcast<u32>(f16)`). See [`Docs/memory-and-buffers.md` — Partial Readback](Docs/memory-and-buffers.md#arrayviewtcopytohostasync--partial-readback-493).
+
+**4.9.2:** OpenCL phi-binding-per-target codegen fix (Tuvok's `Av1RangeDecoderGpu.DecodeCdfQ15` round-trip green); rolls up the rc.7-rc.30 series (signed `Div by pow2` correctness, NaN/Inf codegen across WGSL/GLSL/Wasm/OpenCL, Wasm wait/notify-free + worker-headroom default, helper fn-definition emission for compile-cliff avoidance, `AcceleratorRequirements` capability gating, T4-drift + four-package version-sync CI guards).
 
 **4.9.0:** Complete sub-word data type support (`Int8`, `UInt8`, `Int16`, `UInt16`, `Float16`) across all 6 GPU backends + `CopyFromJS` zero-copy JS->GPU transfer.