|
| 1 | +# D2D / D3D11 Hybrid Compute System |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +D2D's custom compute shader API (`ID2D1ComputeTransform`) has fundamental limitations that prevent full-image reduction operations: |
| 6 | + |
| 7 | +| Limitation | Impact | |
| 8 | +|-----------|--------| |
| 9 | +| **Per-tile UAV clearing** | D2D clears the output `RWTexture2D<float4>` before each tile dispatch. Scatter writes don't accumulate across tiles. | |
| 10 | +| **No custom UAV binding** | `ID2D1ComputeInfo::SetResourceTexture` binds read-only `ID2D1ResourceTexture` (register t), not UAVs (register u). | |
| 11 | +| **No uint atomics on output** | The output UAV is `RWTexture2D<float4>`. `InterlockedMin`/`InterlockedMax`/`InterlockedAdd` require `RWBuffer<uint>`. | |
| 12 | +| **No input as D3D11 texture** | `PrepareForRender` doesn't expose the input image as a D3D11 surface. The effect context is deliberately isolated from the device. | |
| 13 | + |
| 14 | +The built-in `CLSID_D2D1Histogram` effect works around these via private D2D internals not exposed through the public API. |
| 15 | + |
| 16 | +## Solution: Evaluator-Owned D3D11 Dispatch |
| 17 | + |
| 18 | +The graph evaluator owns a **raw D3D11 compute dispatch path** that bypasses D2D's tiling entirely. D2D handles the effect graph wiring (input/output connections), while D3D11 handles the actual computation. |
| 19 | + |
| 20 | +## COM Class Hierarchy |
| 21 | + |
| 22 | +```mermaid |
| 23 | +classDiagram |
| 24 | + class ID2D1EffectImpl { |
| 25 | + <<interface>> |
| 26 | + +Initialize(effectContext, transformGraph) |
| 27 | + +PrepareForRender(changeType) |
| 28 | + +SetGraph(transformGraph) |
| 29 | + } |
| 30 | +
|
| 31 | + class ID2D1DrawTransform { |
| 32 | + <<interface>> |
| 33 | + +SetDrawInfo(drawInfo) |
| 34 | + +MapInputRectsToOutputRect() |
| 35 | + +MapOutputRectToInputRects() |
| 36 | + +MapInvalidRect() |
| 37 | + +GetInputCount() |
| 38 | + } |
| 39 | +
|
| 40 | + class D3D11ComputeRunner { |
| 41 | + <<RWStructuredBuffer<float4> path>> |
| 42 | + -ID3D11ComputeShader* m_shader |
| 43 | + -ID3D11Buffer* m_resultBuffer |
| 44 | + +CompileShader(hlsl) |
| 45 | + +Dispatch(input, cbuffer, resultCount) vector~float~ |
| 46 | + } |
| 47 | +
|
| 48 | + class CustomPixelShaderEffect { |
| 49 | + <<PixelShader path>> |
| 50 | + +LoadShaderBytecode() |
| 51 | + +SetConstantBufferData() |
| 52 | + } |
| 53 | +
|
| 54 | + class CustomComputeShaderEffect { |
| 55 | + <<ComputeShader / D3D11ComputeShader paths>> |
| 56 | + +SetThreadGroupSize() |
| 57 | + +CalculateThreadgroups() |
| 58 | + } |
| 59 | +
|
| 60 | + ID2D1EffectImpl <|.. CustomPixelShaderEffect |
| 61 | + ID2D1DrawTransform <|.. CustomPixelShaderEffect |
| 62 | + ID2D1EffectImpl <|.. CustomComputeShaderEffect |
| 63 | +
|
| 64 | + note for CustomComputeShaderEffect "D2D-tiled compute\nUAV cleared per tile\nNo atomics\n+ D3D11 hybrid mode dispatched\n by GraphEvaluator via D3D11ComputeRunner" |
| 65 | +``` |
| 66 | + |
| 67 | +## Data Flow: D2D → D3D11 Handoff |
| 68 | + |
| 69 | +```mermaid |
| 70 | +flowchart TD |
| 71 | + subgraph D2D_Graph["D2D Effect Graph (Evaluator)"] |
| 72 | + SRC[Source Node<br/>ID2D1Image*] --> FX[Upstream Effect<br/>Gamut Map / Delta E / etc.] |
| 73 | + FX --> CACHE["cachedOutput<br/>(deferred ID2D1Image*)"] |
| 74 | + end |
| 75 | +
|
| 76 | + subgraph Realize["Realize to D3D11 Texture"] |
| 77 | + CACHE --> CREATE["dc->CreateBitmap()<br/>DXGI_FORMAT_R32G32B32A32_FLOAT"] |
| 78 | + CREATE --> DRAW["dc->SetTarget(bitmap)<br/>dc->DrawImage(cachedOutput)<br/>dc->SetTarget(prev)"] |
| 79 | + DRAW --> FLUSH["dc->Flush()<br/>⚠ Required — D2D batches<br/>commands until Flush/EndDraw"] |
| 80 | + FLUSH --> SURFACE["bitmap->GetSurface()<br/>→ IDXGISurface"] |
| 81 | + SURFACE --> QI["surface->QueryInterface()<br/>→ ID3D11Texture2D"] |
| 82 | + end |
| 83 | +
|
| 84 | + subgraph D3D11_Compute["D3D11 Compute Dispatch"] |
| 85 | + QI --> SRV["CreateShaderResourceView()<br/>register(t0)"] |
| 86 | + SRV --> CBUF["Update Constant Buffer<br/>(Width, Height, Channel, NonzeroOnly)"] |
| 87 | + CBUF --> CLEAR["ClearUnorderedAccessViewUint()<br/>(reset result buffer)"] |
| 88 | + CLEAR --> DISPATCH["ctx->Dispatch(1, 1, 1)<br/>32×32 = 1024 threads"] |
| 89 | + end |
| 90 | +
|
| 91 | + subgraph GPU_Reduction["GPU Reduction (groupshared)"] |
| 92 | + DISPATCH --> STRIDE["Each thread strides<br/>across entire image"] |
| 93 | + STRIDE --> LOCAL["Per-thread accumulators<br/>min, max, sum, count"] |
| 94 | + LOCAL --> SHARED["groupshared parallel reduction<br/>log2(1024) = 10 steps"] |
| 95 | + SHARED --> WRITE["Thread 0 writes<br/>8 uints to RWBuffer"] |
| 96 | + end |
| 97 | +
|
| 98 | + subgraph Readback["Result Readback (32 bytes)"] |
| 99 | + WRITE --> COPY["CopyResource()<br/>→ staging buffer"] |
| 100 | + COPY --> MAP["Map() + read 8 uints"] |
| 101 | + MAP --> STATS["ImageStats struct<br/>min, max, mean, samples, nonzero"] |
| 102 | + STATS --> ANALYSIS["node.analysisOutput.fields<br/>(data pins on graph)"] |
| 103 | + end |
| 104 | +``` |
| 105 | + |
| 106 | +## Three Effect Types Compared |
| 107 | + |
| 108 | +| | D2D Pixel Shader | D2D Compute Shader | D3D11 Hybrid Compute | |
| 109 | +|---|---|---|---| |
| 110 | +| **COM class** | `CustomPixelShaderEffect` | `CustomComputeShaderEffect` (D2D-tiled mode) | `CustomComputeShaderEffect` (D3D11 mode) | |
| 111 | +| **D2D interface** | `ID2D1DrawTransform` | `ID2D1ComputeTransform` | `ID2D1DrawTransform` (pass-through) | |
| 112 | +| **Shader target** | `ps_5_0` | `cs_5_0` | `cs_5_0` (dispatched by host) | |
| 113 | +| **Execution** | D2D renders directly | D2D dispatches per-tile | Evaluator dispatches via D3D11 | |
| 114 | +| **Tiling** | D2D-managed | D2D-managed (UAV cleared) | **None** — single dispatch | |
| 115 | +| **Atomics** | N/A | No (float4 UAV only) | **Yes** (RWStructuredBuffer / RWBuffer) | |
| 116 | +| **groupshared** | N/A | Yes (per-tile only) | **Yes** (full image) | |
| 117 | +| **Shader linking** | Yes (D2D optimizes) | No | No | |
| 118 | +| **Image output** | Yes | Yes | Optional (pass-through or none) | |
| 119 | +| **Analysis output** | Via pixel readback | Via pixel readback | Via `RWStructuredBuffer<float4> Result` | |
| 120 | +| **`CustomShaderType`** | `PixelShader` | `ComputeShader` | `D3D11ComputeShader` | |
| 121 | + |
| 122 | +The `D3D11ComputeShader` mode is what powers Channel / Luminance / Chromaticity Statistics, the gamut analysis effects, and any user-authored "analyze the whole image" shader created via the Effect Designer. Internally it dispatches through `Rendering::D3D11ComputeRunner`. |
| 123 | + |
| 124 | +## Usage: ShaderLab Evaluator (Optimized Path) |
| 125 | + |
| 126 | +```cpp |
| 127 | +// In GraphEvaluator::ProcessDeferredCompute(), for D3D11ComputeShader nodes: |
| 128 | + |
| 129 | +// 1. Render upstream D2D output to FP32 bitmap |
| 130 | +winrt::com_ptr<ID2D1Bitmap1> gpuTarget; |
| 131 | +dc->CreateBitmap(D2D1::SizeU(w, h), nullptr, 0, fp32Props, gpuTarget.put()); |
| 132 | +winrt::com_ptr<ID2D1Image> prevTarget; |
| 133 | +dc->GetTarget(prevTarget.put()); |
| 134 | +dc->SetTarget(gpuTarget.get()); |
| 135 | +dc->Clear(D2D1::ColorF(0, 0, 0, 0)); |
| 136 | +dc->DrawImage(upstreamNode->cachedOutput); |
| 137 | +dc->SetTarget(prevTarget.get()); |
| 138 | + |
| 139 | +// 2. Flush D2D command batch — CRITICAL for D2D→D3D11 handoff. |
| 140 | +// D2D batches DrawImage commands until EndDraw() or Flush(). |
| 141 | +// Without this, D3D11 reads uninitialized zeros from the texture. |
| 142 | +dc->Flush(); |
| 143 | + |
| 144 | +// 3. Get D3D11 texture (zero-copy — same DXGI surface) |
| 145 | +winrt::com_ptr<IDXGISurface> surface; |
| 146 | +gpuTarget->GetSurface(surface.put()); |
| 147 | +winrt::com_ptr<ID3D11Texture2D> d3dTexture; |
| 148 | +surface->QueryInterface(d3dTexture.put()); |
| 149 | + |
| 150 | +// 4. Dispatch GPU reduction (single call) |
| 151 | +auto stats = m_gpuReduction.Reduce(d3dCtx, d3dTexture.get(), channel, nonzeroOnly); |
| 152 | + |
| 153 | +// 5. Populate analysis output for graph data pins |
| 154 | +node->analysisOutput.fields = { {"Min", stats.min}, {"Max", stats.max}, ... }; |
| 155 | +``` |
| 156 | +
|
| 157 | +## Known Limitations |
| 158 | +
|
| 159 | +- **D2D→D3D11 flush required**: When rendering a D2D effect chain to a bitmap and then reading it with D3D11, `dc->Flush()` **must** be called between `DrawImage` and any D3D11 access to the underlying texture. D2D batches draw commands until `EndDraw()` or `Flush()` — without an explicit flush, D3D11 reads zeros from the texture. Applied in `DispatchUserD3D11Compute` in `GraphEvaluator`. |
| 160 | +- **D2D draw session required**: `ProcessDeferredCompute` must run inside an active `BeginDraw`/`EndDraw` session because `DispatchUserD3D11Compute` calls `dc->DrawImage` internally to pre-render the upstream chain into an FP32 bitmap. Outside a draw session that DrawImage silently no-ops and the compute reads a black input texture. The GUI's `RenderFrame`, the headless host's `runEval` / `RunRender`, and the test bench all wrap the call accordingly. |
| 161 | +- **No shader linking**: D3D11 compute shaders are opaque to D2D. They don't participate in D2D's shader linking optimization for chained pixel shader effects. |
| 162 | +- **Single thread group per dispatch**: `D3D11ComputeRunner` dispatches `(1,1,1)` — one group of 1024 threads. For images larger than ~33 megapixels (1024² pixels per thread), a multi-dispatch pyramid would be needed. |
| 163 | +
|
| 164 | +
|
| 165 | +--- |
| 166 | +
|
| 167 | +Back to [docs/](../README.md) • [Repo root](../../README.md) |
0 commit comments