Skip to content

Commit 4430b67

Browse files
committed
Docs: ArrayView<T>.CopyToHostAsync partial readback (4.9.3)
Adds a section to Docs/memory-and-buffers.md covering the new partial-readback extension shipped in 4.9.3: AV1 YUV plane separation example, per-backend implementation table (WebGPU CopyBufferToBuffer + mapAsync, WebGL GL-worker partial readback, Wasm SAB Uint8Array slice, CUDA / OpenCL / CPU view.CopyToCPU native partial copy), both extension overloads (ArrayView<T> + ArrayView1D), and guidance on when not to use it. README "Recent Highlights" updated with a 4.9.3 entry pointing at the doc.
1 parent e558d4f commit 4430b67

2 files changed

Lines changed: 59 additions & 1 deletion

File tree

Docs/memory-and-buffers.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,62 @@ float[] cachedResults = new float[bufferLength];
190190
await buffer.CopyToHostAsync(cachedResults);
191191
```
192192

193+
### ArrayView&lt;T&gt;.CopyToHostAsync — Partial Readback (4.9.3+)
194+
195+
Reads only a sub-range of a GPU buffer to a host array. The byte range outside the view never crosses the device-host boundary - this is a real per-backend partial copy, **not** a full-buffer readback followed by a CPU-side slice.
196+
197+
Use this when a single GPU buffer holds multiple logical regions (per-channel image planes, per-tensor model outputs, per-frame audio chunks, etc.) and you need each region as its own host array:
198+
199+
```csharp
200+
using SpawnDev.ILGPU;
201+
202+
// One GPU buffer with three logical regions (Y / U / V planes for YUV 4:2:0):
203+
var y = await dRecon.View.SubView(0, yLen ).CopyToHostAsync();
204+
var u = await dRecon.View.SubView(yLen, uvLen).CopyToHostAsync();
205+
var v = await dRecon.View.SubView(yLen + uvLen, uvLen).CopyToHostAsync();
206+
```
207+
208+
Each call only transfers its own slice's bytes. Compare with the full-buffer pattern, which reads the whole buffer and slices on the CPU:
209+
210+
```csharp
211+
// AVOID — reads the entire dRecon buffer to host every call,
212+
// then slices on the CPU. Fine for small buffers, wasteful for large ones.
213+
var full = await dRecon.CopyToHostAsync<byte>();
214+
var y = new byte[yLen]; Buffer.BlockCopy(full, 0, y, 0, yLen);
215+
var u = new byte[uvLen]; Buffer.BlockCopy(full, yLen, u, 0, uvLen);
216+
var v = new byte[uvLen]; Buffer.BlockCopy(full, yLen + uvLen, v, 0, uvLen);
217+
```
218+
219+
**Per-backend implementation** (no fallback to full-buffer + slice on any backend):
220+
221+
| Backend | Underlying primitive |
222+
|---|---|
223+
| **WebGPU** | `queue.CopyBufferToBuffer(srcBuf, srcByteOffset, staging, 0, byteCount)` -> `mapAsync(Read, 0, byteCount)`. Staging is sized to the slice, not the parent buffer. |
224+
| **WebGL** | GL-worker `ReadbackAndGetUint8ArrayAsync(buf, sourceByteOffset, byteCount)` partial range path. |
225+
| **Wasm** | `new Uint8Array(SharedBuffer, byteOffset, byteCount)` window onto the SAB slot. The rest of wasm linear memory is not touched. |
226+
| **CUDA / OpenCL / CPU** | ILGPU's native `view.CopyToCPU(target)`. The view's start offset and length encode the partial range, so this is one `cudaMemcpy` / `clEnqueueReadBuffer` / direct memcpy of just the slice's bytes. |
227+
228+
**Two overloads** are provided so that `MemoryBuffer1D.View.SubView(...)` resolves naturally without an explicit cast:
229+
230+
```csharp
231+
public static Task<T[]> CopyToHostAsync<T>(this ArrayView<T> view)
232+
where T : unmanaged;
233+
234+
public static Task<T[]> CopyToHostAsync<T, TStride>(this ArrayView1D<T, TStride> view)
235+
where T : unmanaged
236+
where TStride : struct, IStride1D;
237+
```
238+
239+
The `ArrayView1D` overload forwards to the `ArrayView<T>` overload via `view.BaseView`, which is already the sliced range on a SubView'd 1D view.
240+
241+
**Throws:**
242+
- `InvalidOperationException` if the view has no backing buffer.
243+
- `ArgumentOutOfRangeException` if the view's byte range exceeds the buffer's length.
244+
245+
**When NOT to use this overload:**
246+
- You want the entire buffer's contents - use `buffer.CopyToHostAsync<T>()` directly. The `MemoryBuffer` overload exists for that case and avoids the SubView object construction.
247+
- You're writing into a pre-allocated array - use `buffer.CopyToHostAsync(targetArray)` for the per-frame render loop pattern. The partial-readback overload always allocates a fresh `T[]`.
248+
193249
### CopyToHostUint8ArrayAsync — JavaScript Interop
194250

195251
Returns a JavaScript `Uint8Array` for direct use with browser APIs (Canvas, WebGL textures, etc.):

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@ Write parallel compute code in C# and let the library pick the best available ba
99
1010
## Recent Highlights
1111

12-
**4.9.2 (current):** OpenCL phi-binding-per-target codegen fix (Tuvok's `Av1RangeDecoderGpu.DecodeCdfQ15` round-trip green); rolls up the rc.7-rc.30 series (signed `Div by pow2` correctness, NaN/Inf codegen across WGSL/GLSL/Wasm/OpenCL, Wasm wait/notify-free + worker-headroom default, helper fn-definition emission for compile-cliff avoidance, `AcceleratorRequirements` capability gating, T4-drift + four-package version-sync CI guards).
12+
**4.9.3 (current):** New `ArrayView<T>.CopyToHostAsync()` extension - real per-backend partial readback for sub-views. One device buffer can be split into per-channel / per-plane host arrays without the host iterating over the full buffer. WebGPU `Half` NaN/Inf bit-pattern codegen fix (multi-compare paths now route f16 through `bitcast<u32>(vec2<f16>(x, 0.0h))` instead of the invalid `bitcast<u32>(f16)`). See [`Docs/memory-and-buffers.md` — Partial Readback](Docs/memory-and-buffers.md#arrayviewtcopytohostasync--partial-readback-493).
13+
14+
**4.9.2:** OpenCL phi-binding-per-target codegen fix (Tuvok's `Av1RangeDecoderGpu.DecodeCdfQ15` round-trip green); rolls up the rc.7-rc.30 series (signed `Div by pow2` correctness, NaN/Inf codegen across WGSL/GLSL/Wasm/OpenCL, Wasm wait/notify-free + worker-headroom default, helper fn-definition emission for compile-cliff avoidance, `AcceleratorRequirements` capability gating, T4-drift + four-package version-sync CI guards).
1315

1416
**4.9.0:** Complete sub-word data type support (`Int8`, `UInt8`, `Int16`, `UInt16`, `Float16`) across all 6 GPU backends + `CopyFromJS` zero-copy JS->GPU transfer.
1517

0 commit comments

Comments
 (0)