You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds a section to Docs/memory-and-buffers.md covering the new partial-readback
extension shipped in 4.9.3: AV1 YUV plane separation example, per-backend
implementation table (WebGPU CopyBufferToBuffer + mapAsync, WebGL GL-worker
partial readback, Wasm SAB Uint8Array slice, CUDA / OpenCL / CPU view.CopyToCPU
native partial copy), both extension overloads (ArrayView<T> + ArrayView1D),
and guidance on when not to use it.
README "Recent Highlights" updated with a 4.9.3 entry pointing at the doc.
Reads only a sub-range of a GPU buffer to a host array. The byte range outside the view never crosses the device-host boundary - this is a real per-backend partial copy, **not** a full-buffer readback followed by a CPU-side slice.
196
+
197
+
Use this when a single GPU buffer holds multiple logical regions (per-channel image planes, per-tensor model outputs, per-frame audio chunks, etc.) and you need each region as its own host array:
198
+
199
+
```csharp
200
+
usingSpawnDev.ILGPU;
201
+
202
+
// One GPU buffer with three logical regions (Y / U / V planes for YUV 4:2:0):
**Per-backend implementation** (no fallback to full-buffer + slice on any backend):
220
+
221
+
| Backend | Underlying primitive |
222
+
|---|---|
223
+
|**WebGPU**|`queue.CopyBufferToBuffer(srcBuf, srcByteOffset, staging, 0, byteCount)` -> `mapAsync(Read, 0, byteCount)`. Staging is sized to the slice, not the parent buffer. |
224
+
|**WebGL**| GL-worker `ReadbackAndGetUint8ArrayAsync(buf, sourceByteOffset, byteCount)` partial range path. |
225
+
|**Wasm**|`new Uint8Array(SharedBuffer, byteOffset, byteCount)` window onto the SAB slot. The rest of wasm linear memory is not touched. |
226
+
|**CUDA / OpenCL / CPU**| ILGPU's native `view.CopyToCPU(target)`. The view's start offset and length encode the partial range, so this is one `cudaMemcpy` / `clEnqueueReadBuffer` / direct memcpy of just the slice's bytes. |
227
+
228
+
**Two overloads** are provided so that `MemoryBuffer1D.View.SubView(...)` resolves naturally without an explicit cast:
The `ArrayView1D` overload forwards to the `ArrayView<T>` overload via `view.BaseView`, which is already the sliced range on a SubView'd 1D view.
240
+
241
+
**Throws:**
242
+
-`InvalidOperationException` if the view has no backing buffer.
243
+
-`ArgumentOutOfRangeException` if the view's byte range exceeds the buffer's length.
244
+
245
+
**When NOT to use this overload:**
246
+
- You want the entire buffer's contents - use `buffer.CopyToHostAsync<T>()` directly. The `MemoryBuffer` overload exists for that case and avoids the SubView object construction.
247
+
- You're writing into a pre-allocated array - use `buffer.CopyToHostAsync(targetArray)` for the per-frame render loop pattern. The partial-readback overload always allocates a fresh `T[]`.
Copy file name to clipboardExpand all lines: README.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,9 @@ Write parallel compute code in C# and let the library pick the best available ba
9
9
10
10
## Recent Highlights
11
11
12
-
**4.9.2 (current):** OpenCL phi-binding-per-target codegen fix (Tuvok's `Av1RangeDecoderGpu.DecodeCdfQ15` round-trip green); rolls up the rc.7-rc.30 series (signed `Div by pow2` correctness, NaN/Inf codegen across WGSL/GLSL/Wasm/OpenCL, Wasm wait/notify-free + worker-headroom default, helper fn-definition emission for compile-cliff avoidance, `AcceleratorRequirements` capability gating, T4-drift + four-package version-sync CI guards).
12
+
**4.9.3 (current):** New `ArrayView<T>.CopyToHostAsync()` extension - real per-backend partial readback for sub-views. One device buffer can be split into per-channel / per-plane host arrays without the host iterating over the full buffer. WebGPU `Half` NaN/Inf bit-pattern codegen fix (multi-compare paths now route f16 through `bitcast<u32>(vec2<f16>(x, 0.0h))` instead of the invalid `bitcast<u32>(f16)`). See [`Docs/memory-and-buffers.md` — Partial Readback](Docs/memory-and-buffers.md#arrayviewtcopytohostasync--partial-readback-493).
13
+
14
+
**4.9.2:** OpenCL phi-binding-per-target codegen fix (Tuvok's `Av1RangeDecoderGpu.DecodeCdfQ15` round-trip green); rolls up the rc.7-rc.30 series (signed `Div by pow2` correctness, NaN/Inf codegen across WGSL/GLSL/Wasm/OpenCL, Wasm wait/notify-free + worker-headroom default, helper fn-definition emission for compile-cliff avoidance, `AcceleratorRequirements` capability gating, T4-drift + four-package version-sync CI guards).
13
15
14
16
**4.9.0:** Complete sub-word data type support (`Int8`, `UInt8`, `Int16`, `UInt16`, `Float16`) across all 6 GPU backends + `CopyFromJS` zero-copy JS->GPU transfer.
0 commit comments