Skip to content

Commit 0ae9823

Browse files
Merge branch 'microsoft:main' into fix/cmake-artifact-issue-28468
2 parents 6ecde97 + 1472c16 commit 0ae9823

641 files changed

Lines changed: 39853 additions & 1275729 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/cuda-attention-kernel-patterns/SKILL.md

Lines changed: 171 additions & 24 deletions
Large diffs are not rendered by default.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
name: cuda-cutlass-fmha-incremental-rebuild
3+
description: >
4+
Use when rebuilding ONNX Runtime CUDA after editing CUTLASS fused-MHA headers
5+
(onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/*.h such as kernel_forward.h or
6+
fmha_launch_template.h), or when a header edit "passed" an incremental build but
7+
test behavior did not change. Explains the nvcc depfile gotcha that produces stale
8+
Memory-Efficient-Attention (MEA) kernels and binaries, and how to force a correct
9+
recompile. Also covers disk-space frugality on shared GPU dev boxes.
10+
---
11+
12+
# Incremental rebuilds silently use STALE CUTLASS fused-MHA kernels
13+
14+
> The **general** false-green principles (stale binary, wrong-artifact mtime) are summarised
15+
> in the `ort-test` skill's "False-green taxonomy". This skill is the CUDA/CUTLASS-specific
16+
> detail.
17+
18+
## The gotcha (verification-integrity bug)
19+
20+
`nvcc`-generated depfiles do **not** track the CUTLASS fused-MHA headers under
21+
`onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/` (e.g. `kernel_forward.h`,
22+
`fmha_launch_template.h`). These headers are `#include`d by the `fmha_sm*.cu`
23+
translation units, but the build system does not record that dependency.
24+
25+
Consequence: after you edit one of those headers, an **incremental** `build.sh`:
26+
27+
- does **not** recompile `fmha_sm*.cu`,
28+
- reports `[100%] Built target ...` and exits 0,
29+
- leaves the recompiled artifacts — the `fmha_sm*.cu.o` objects and the
30+
`libonnxruntime_providers_cuda.so` they link into — **unchanged** (same `mtime` as
31+
the pre-edit build).
32+
33+
(Do **not** use the gtest test-exe mtime as the stale symptom: in the shared-provider
34+
build the exe `dlopen`s the `.so` and is **not** relinked, so its mtime stays old even
35+
after a *correct* rebuild — see "How to confirm" below. The reliable diagnostic signal
36+
is the `fmha_sm*.cu.o` / `.so` mtime.)
37+
38+
So your "successful" rebuild is running the **old** kernel. Tests that should now
39+
pass (or fail) reflect the previous code, not your edit. This silently invalidates
40+
any FAIL→PASS / PASS→FAIL verification.
41+
42+
## The fix — force recompile the .cu units
43+
44+
Before rebuilding after editing any `cutlass_fmha/*.h` header:
45+
46+
```bash
47+
touch onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/*.cu
48+
```
49+
50+
Then run the normal build command. This forces the `fmha_sm*.cu` translation units
51+
(and downstream binaries) to recompile against your header change.
52+
53+
## How to confirm the rebuild was real (don't trust "[100%] Built")
54+
55+
Confirm that the artifact which actually **links** the recompiled `fmha_sm*.cu.o`
56+
is newer than your header edit.
57+
58+
⚠️ **Do NOT just check the test EXE mtime — it can falsely flag a good build as
59+
stale.** In the shared-provider build configuration (the default here), the CUDA
60+
execution provider is a **shared module**: the recompiled `fmha_sm*.cu.o` link into
61+
`libonnxruntime_providers_cuda.so`, and the `onnxruntime_provider_test` executable
62+
**dlopens** that `.so` — it is **not relinked**. So after a *correct* rebuild the
63+
test exe `mtime` stays **old** while the `.so` advances. Checking the exe alone
64+
would wrongly conclude the build was stale.
65+
66+
Check the right artifact for your link mode:
67+
68+
- **Shared-provider build (default):** the `.so` that links the recompiled `.o`
69+
`build/<dir>/<cfg>/libonnxruntime_providers_cuda.so`
70+
- **Statically-linked provider:** the test exe itself (`onnxruntime_provider_test`)
71+
72+
Safest check — `stat` both the recompiled object and the `.so`, and confirm BOTH
73+
are newer than the header edit:
74+
75+
```bash
76+
stat -c '%y %n' onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h
77+
# in your build dir, e.g. build/Debug_quickbuild/Debug/:
78+
stat -c '%y %n' libonnxruntime_providers_cuda.so
79+
# and the actual recompiled object (path varies by build dir):
80+
find . -name 'fmha_sm80.cu.o' -exec stat -c '%y %n' {} +
81+
```
82+
83+
If the `.so` (and the `fmha_sm*.cu.o`) timestamps are older than (or equal to) the
84+
header edit, the build was stale — `touch` the `.cu` files and rebuild. The most
85+
reliable signal of all is behavioral: a test that was failing now passes (a stale
86+
binary cannot flip its result).
87+
88+
## Related: pick the right test binary
89+
90+
This is the **CUDA/CUTLASS instance of false-green mode 1** (zero-match / wrong binary) —
91+
see the `ort-test` skill's "False-green taxonomy" for the general principle. In short:
92+
attention/MEA/Flash boundary gtests (e.g. `FlashStructuralEmptyRows*`,
93+
`Attention_Causal_NonPadKVSeqLen_MEA_*`) live in **`onnxruntime_provider_test`**, which CI
94+
runs; `onnxruntime_test_all` does not contain them and gives a false green. Verify the
95+
MEA/Flash boundary fix against `onnxruntime_provider_test`.
96+
97+
## Related: disk frugality on shared GPU dev boxes
98+
99+
Full ORT CUDA builds are large (test binaries ~1 GB each; a build dir can reach
100+
tens of GB). On a shared box, `/home` filling to 100% makes builds fail in
101+
non-obvious places — e.g. `git submodule sync` reporting `No space left on device`
102+
or a `config.lock` error, not an obvious "disk full" at the compile step.
103+
104+
Before a big rebuild, check free space and clean only clearly-stale, regenerable
105+
build directories (old dated experiment dirs). Never delete another agent's active
106+
build dir or anything ambiguous:
107+
108+
```bash
109+
df -h /home
110+
du -sh build/* | sort -h
111+
```

0 commit comments

Comments
 (0)