Skip to content

Commit 056c02f

Browse files
authored
fix(daemon): match D3D12 adapter via real chunk name + exact DeviceId (#225) (#230)
PR #226's structured-data adapter guard never fired on real hardware. Defects fixed here: 1. Chunk-name guard: the real renderdoc 1.42 chunk is "Internal::Driver Initialisation Parameters" (with a space), which none of the legacy markers ("DriverInit"/"EnumAdapters"/ "CreateDXGIFactory") match. Add "Driver Initialisation Parameters" to d3d_markers; keep the legacy markers (commented as version-fallback) for other versions/paths. 2. Tree depth: the real Description lives at depth 3 (chunk -> InitParams -> AdapterDesc -> Description), but the old walker only inspected depth 1/2. Replace it with a depth-bounded recursive descent (cap 5) returning an AdapterMatch carrying (description, device_id); DeviceId read via AsInt(). VendorId is deliberately not parsed (PCI id vs GPUVendor enum are unrelated, so it has no consumer). 3. Disambiguation: name-substring matching is fragile for same-vendor multi-GPU. _match_capture_gpu now selects in three tiers: exact DeviceId == g.deviceID (primary), name-substring (secondary), vendor-priority (tertiary, unchanged). 4. WARP/software sentinel: DXGI_ADAPTER_DESC.DeviceId == 0 on software/WARP/virtualized captures. Tier-1 is gated on truthiness (not "is not None") and skips any candidate with deviceID == 0, so a zero id can never bind the WARP adapter. 5. Version-drift diagnostic: a distinct _log.warning fires when a multi-GPU capture has structured data but no recognized vk/d3d adapter chunk, surfacing future renderdoc chunk renames instead of silently falling back to vendor priority. Adds 14 unit tests (5 red-first defect proofs + 9 regression guards, incl. the WARP-sentinel guard).
1 parent 88781c2 commit 056c02f

6 files changed

Lines changed: 767 additions & 28 deletions

File tree

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# Fix #225 (fix-forward): D3D12 Chunk Name, Depth, and Exact-ID Matching
2+
3+
## Summary
4+
5+
PR #226 (squash 3735f2b) shipped the D3D12 adapter-matching logic but the
6+
structured-data guard never fires on real hardware. Hardware verification by
7+
issue reporter @YunHsiao (renderdoc 1.42, Windows multi-GPU D3D12 box) confirmed
8+
that the user-facing crash is resolved by the vendor-priority fallback introduced
9+
in #226, but the structured-data path — the primary, deterministic signal — is
10+
silently skipped on every real capture. Three distinct defects prevent it from
11+
working. This change is a fix-forward on the merged commit; issue #225 remains
12+
open until hardware re-verification passes.
13+
14+
## Problem
15+
16+
### Canonical dump (ground truth)
17+
18+
Reporter provided the structured-data tree from renderdoc 1.42's bundled Python
19+
module on a real D3D12 capture. The ONLY chunk carrying adapter identity is
20+
`chunk[0]`, whose name is literally:
21+
22+
```
23+
Internal::Driver Initialisation Parameters
24+
```
25+
26+
Its subtree:
27+
28+
```
29+
Internal::Driver Initialisation Parameters [Chunk] children=1
30+
InitParams [D3D12InitParams] children=7
31+
MinimumFeatureLevel [D3D_FEATURE_LEVEL]
32+
AdapterDesc [DXGI_ADAPTER_DESC] children=9
33+
Description [string] = 'NVIDIA RTX 4500 Ada Generation'
34+
VendorId [uint32_t]
35+
DeviceId [uint32_t]
36+
SubSysId [uint32_t]
37+
Revision [uint32_t]
38+
DedicatedVideoMemory [uint64_t]
39+
DedicatedSystemMemory [uint64_t]
40+
SharedSystemMemory [uint64_t]
41+
AdapterLuid [LUID]
42+
usedDXIL, VendorExtensions, VendorUAV, VendorUAVSpace, SDKVersion
43+
```
44+
45+
All remaining chunks are `ID3D12Device::*` / `ID3D12Resource::SetName` API calls
46+
and 4834× `Internal::Initial Contents`. There is no `EnumAdapters`, no
47+
`CreateDXGIFactory`, no `DriverInit` chunk anywhere in the capture.
48+
49+
Available GPUs on reporter's machine:
50+
- `AMD Radeon(TM) Graphics` vendor=2 deviceID=5056
51+
- `NVIDIA RTX 4500 Ada Generation` vendor=6 deviceID=10161
52+
- `WARP Rasterizer` vendor=9 deviceID=0
53+
54+
### Defect 1 — outer chunk guard never matches
55+
56+
`d3d_markers = ("DriverInit", "EnumAdapters", "CreateDXGIFactory")`
57+
58+
None of the three substrings occur in `"Internal::Driver Initialisation Parameters"`.
59+
The space between "Driver" and "Initialisation" defeats `"DriverInit"`. The guard
60+
falls through on every chunk and `_find_adapter_description` is never called.
61+
62+
### Defect 2 — `_find_adapter_description` only walks 2 levels
63+
64+
The real `Description` field is at depth 3 relative to the chunk root:
65+
66+
```
67+
chunk (depth 0)
68+
InitParams (depth 1)
69+
AdapterDesc (depth 2)
70+
Description (depth 3) ← required
71+
```
72+
73+
The current walker checks direct children (depth 1) for `Description`/`DeviceName`
74+
and direct children of `AdapterDesc`/`pAdapter`/`adapter` (depth 2 for the
75+
container, depth 3 only if the container is at depth 1). Because `AdapterDesc` is
76+
at depth 2 (a grandchild), the current inner loop is never reached.
77+
78+
### Defect 3 — name-substring match is fragile for same-vendor multi-GPU
79+
80+
The name-substring comparison `desc_l in name_l or name_l in desc_l` is the
81+
only disambiguation signal once a Description is found. On a system with two
82+
NVIDIA cards whose names differ only by suffix, substring matching can produce a
83+
false-positive match. The `AdapterDesc` already exposes `DeviceId` (PCI Device
84+
ID, uint32) and `VendorId` (PCI Vendor ID, uint32). The reporter explicitly
85+
recommended hardening to exact numeric match.
86+
87+
`GPUDevice.deviceID` in the renderdoc Python API carries the PCI Device ID, which
88+
is what `DXGI_ADAPTER_DESC.DeviceId` encodes. An exact `DeviceId` match is
89+
unambiguous and free of string-normalization risk.
90+
91+
## Design
92+
93+
### Part 1 — extend chunk name markers
94+
95+
Add `"Driver Initialisation Parameters"` to `d3d_markers`. Retain the three
96+
existing substrings (`"DriverInit"`, `"EnumAdapters"`, `"CreateDXGIFactory"`) for
97+
compatibility with other renderdoc versions or API paths that may use them.
98+
99+
```python
100+
d3d_markers = (
101+
"Driver Initialisation Parameters", # renderdoc 1.42 D3D12
102+
"DriverInit",
103+
"EnumAdapters",
104+
"CreateDXGIFactory",
105+
)
106+
```
107+
108+
### Part 2 — bounded recursive descent in `_find_adapter_description`
109+
110+
Replace the 2-level manual walk with a depth-bounded recursive descent (cap at
111+
depth 4 or 5). At each node:
112+
113+
1. If the node name is `Description` or `DeviceName`, return its string value and
114+
a paired `DeviceId` if available at the same level.
115+
2. If the node name is `AdapterDesc`, `pAdapter`, or `adapter`, recurse into it
116+
with depth decremented — this finds `Description` at depth 3 regardless of
117+
intermediate wrapper levels.
118+
3. Otherwise recurse into all children with depth decremented.
119+
120+
Return type changes to a small named tuple carrying
121+
`(description, device_id)` so the caller can use the numeric `DeviceId` for the
122+
primary match. `VendorId` is deliberately NOT parsed (see Risks — it cannot be
123+
compared against the renderdoc `GPUVendor` enum, so carrying it would be dead).
124+
125+
`DeviceId` is a `uint32_t` structured-data scalar. The renderdoc `SDObject`
126+
interface used everywhere else in this code reads strings via `AsString()`; the
127+
integer-valued counterpart is `AsInt()` (there is no `AsUInt()` on `SDObject`).
128+
The implementation MUST read `DeviceId` via `child.AsInt()`, consistent with the
129+
existing `AsString()` usage on the same interface. There is no top-level integer
130+
`.data` attribute to read directly.
131+
132+
### Part 3 — exact numeric DeviceId match as primary signal
133+
134+
In `_match_capture_gpu`, after obtaining the result from
135+
`_find_adapter_description`, apply a three-tier selection:
136+
137+
1. **Primary**: exact `DeviceId` match — `match.device_id == g.deviceID` for
138+
each GPU `g`. The first matching GPU wins. A `DeviceId` of `0` is treated as
139+
a software/WARP sentinel (software, WARP and virtualized adapters report
140+
`DXGI_ADAPTER_DESC.DeviceId == 0`): the tier-1 gate is truthiness, not
141+
`is not None`, so a present-but-zero id falls through to tier 2; and any
142+
candidate GPU whose `deviceID == 0` is skipped inside the exact-match loop so
143+
a real non-zero parsed id can never coincidentally bind the WARP adapter.
144+
2. **Secondary**: name-substring match (existing logic) — used when no GPU's
145+
`deviceID` matches the parsed value, when `DeviceId` is absent, or when the
146+
parsed `DeviceId` is the `0` sentinel.
147+
3. **Tertiary**: vendor-priority fallback (existing logic, unchanged).
148+
149+
### Known limitations / risks
150+
151+
**Why VendorId is not parsed.** `DXGI_ADAPTER_DESC.VendorId` is the raw PCI
152+
vendor ID (e.g., NVIDIA = 0x10DE = 4318). The renderdoc `GPUVendor` enum uses
153+
ordinals (nVidia = 6). These values are in unrelated spaces, so `VendorId`
154+
cannot be compared against `g.vendor` and there is no other consumer for it.
155+
The field is therefore deliberately NOT read from the chunk (carrying it would
156+
be dead state). `DeviceId``g.deviceID` is the only exact key used.
157+
158+
**(RISK-A) `deviceID` vs PCI `DeviceId` same-space assumption.** Tier-1 assumes
159+
the renderdoc `GPUDevice.deviceID` is the same integer as the PCI
160+
`DXGI_ADAPTER_DESC.DeviceId`. This holds on the D3D12 path per the reporter
161+
dump but is unverifiable on Linux/Vulkan from this repo. A mismatch silently
162+
demotes tier-1 to name-substring (correctness preserved, determinism lost).
163+
*Reporter verification question:* "On the #225 capture, print
164+
`GetAvailableGPUs()[i].deviceID` for the NVIDIA RTX 4500 entry AND
165+
`AdapterDesc.DeviceId` parsed from the same capture; confirm they are
166+
bitwise-equal (expected 10161)."
167+
168+
**(RISK-B) Chunk-name version pinning.** The substring
169+
`"Driver Initialisation Parameters"` is renderdoc-1.42-specific (British
170+
spelling). A future renderdoc renaming it makes the guard miss every chunk and
171+
silently reverts to vendor-priority — exactly the #225 failure class. Mitigated
172+
by the new FIX-3 diagnostic `_log.warning` ("no recognized adapter chunk … —
173+
renderdoc version may have changed chunk naming") which fires only on the
174+
multi-GPU + structured-data + no-marker-chunk path, making the regression
175+
loud and diagnosable instead of silent. *Reporter verification question:* "After
176+
a renderdoc upgrade, if multi-GPU replay picks the wrong card, grep the daemon
177+
log for `renderdoc version may have changed chunk naming`; its presence confirms
178+
the chunk was renamed and the new name must be added to `d3d_markers`."
179+
180+
**Bounded-recursion depth.** A depth cap of 4 is sufficient for the observed tree
181+
(Description at depth 3). Setting the cap to 5 provides one level of headroom.
182+
Setting it higher risks traversing large subtrees (4834× `Initial Contents` chunks
183+
are siblings at the top level, not descendants, so they are not a concern; the
184+
guard in Part 1 prevents entering them).
185+
186+
**Chunk-name variance across renderdoc versions.** The real chunk name contains
187+
"Initialisation" (British spelling). Future renderdoc versions may rename it.
188+
Adding the new marker as an additional entry (not replacing the old ones) hedges
189+
against this. If the chunk name is ever confirmed to differ, the new entry should
190+
be updated in a follow-up.
191+
192+
**`GPUDevice` field names.** The field names `g.deviceID` and `g.name` are
193+
inferred from the existing code and the reported dump. The implementer must
194+
confirm these names against the bundled `renderdoc.pyi` or equivalent before
195+
finalising the implementation.
196+
197+
## Spec Delta
198+
199+
The existing scenario "Multi-GPU capture replay" under **Requirement: Replay
200+
lifecycle** in `openspec/specs/daemon/spec.md` (added by #226) requires the
201+
following amendment:
202+
203+
The target scenario in `openspec/specs/daemon/spec.md` (HEAD lines 38–45) uses
204+
bold `**WHEN**` / `**AND**` / `**THEN**` markers. The delta below reproduces
205+
those bytes verbatim so it applies cleanly:
206+
207+
```diff
208+
#### Scenario: Multi-GPU capture replay
209+
- **WHEN** the capture was taken on a multi-GPU system
210+
- **AND** multiple GPUs are available for replay
211+
-- **THEN** the daemon selects the GPU whose name matches structured-data adapter
212+
- metadata extracted from the capture
213+
+- **THEN** the daemon walks structured-data chunks whose name contains any of
214+
+ "Driver Initialisation Parameters", "DriverInit", "EnumAdapters", or
215+
+ "CreateDXGIFactory"
216+
+- **AND** within each such chunk descends recursively (depth ≤ 5) to locate an
217+
+ AdapterDesc/pAdapter/adapter subtree
218+
+- **AND** selects the GPU whose PCI DeviceId exactly matches
219+
+ DXGI_ADAPTER_DESC.DeviceId when that field is present
220+
+- **AND** falls back to case-insensitive name-substring match when no exact
221+
+ DeviceId match is found
222+
- **AND** if no structured-data match exists, Software/WARP adapters are excluded
223+
and the highest-ranked discrete GPU is preferred (nVidia > AMD > Intel)
224+
- **AND** a single available GPU is always returned directly without inspection
225+
```
226+
227+
This delta has been applied verbatim to `openspec/specs/daemon/spec.md` as
228+
part of this change.
229+
230+
## Files Changed
231+
232+
- `src/rdc/daemon_server.py``d3d_markers` tuple; `_find_adapter_description`
233+
return type and walk algorithm; `_match_capture_gpu` selection logic
234+
- `openspec/specs/daemon/spec.md` — scenario amendment (see Spec Delta above)
235+
- `tests/unit/test_daemon_server_unit.py` — new unit tests (see `test-plan.md`)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Fix #225 (fix-forward): Tasks
2+
3+
- [ ] Opus review of `proposal.md` and `test-plan.md`; revise as needed
4+
- [ ] `daemon_server.py`: add `"Driver Initialisation Parameters"` to `d3d_markers` (keep existing three entries)
5+
- [ ] `daemon_server.py`: rewrite `_find_adapter_description` with bounded recursive descent (depth ≤ 5); return `(description, device_id)` named tuple (`VendorId` deliberately not parsed)
6+
- [ ] `daemon_server.py`: update `_match_capture_gpu` selection logic — exact `DeviceId` match primary (with `0` treated as software/WARP sentinel: tier-1 gated on truthiness and `deviceID == 0` candidates skipped), name-substring secondary, vendor-priority tertiary
7+
- [ ] `daemon_server.py`: add FIX-3 diagnostic `_log.warning` for chunk-name version drift (multi-GPU + sd present + no recognized vk/d3d marker chunk seen)
8+
- [ ] `openspec/specs/daemon/spec.md`: apply Spec Delta from `proposal.md` to the "Multi-GPU capture replay" scenario
9+
- [ ] `tests/unit/test_daemon_server_unit.py`: implement all 14 test cases from `test-plan.md` (5 red-first defect proofs + 9 regression guards, incl. the WARP-sentinel guard); confirm each red-first test fails against current `daemon_server.py` BEFORE editing source; the existing `_gpu` helper hardcodes `deviceID=0`, so DeviceId tests must build GPUs with explicit distinct `deviceID` values
10+
- [ ] `pixi run lint && pixi run test` passes
11+
- [ ] Fresh Opus review of implementation diff
12+
- [ ] Adversarial review: walk the real canonical dump tree by hand against the new code path; confirm guard fires on `"Internal::Driver Initialisation Parameters"` and `DeviceId=10161` selects NVIDIA
13+
- [ ] Base/target branch is `master` — verified, not assumed: `git -C <repo> symbolic-ref refs/remotes/origin/HEAD` resolves to `refs/remotes/origin/master`, and `git remote show origin` reports `HEAD branch: master` (rdc-cli project convention). Branch off `master`; do not assume `main`
14+
- [ ] Open PR targeting `master`; tag @YunHsiao for hardware re-verification

0 commit comments

Comments
 (0)