[Bug][TMA] Skip OOB gate for 1D TMA bulk-copy eligibility (#2180)#2235
[Bug][TMA] Skip OOB gate for 1D TMA bulk-copy eligibility (#2180)#2235mygitljf wants to merge 1 commit into
Conversation
When the global tensor has a dynamic outer shape and the index cannot be statically proven within bounds, `buffer_oob` is set to true. The existing logic in `AnalyzeCopyFacts` then masked out 1D bulk-copy eligibility along with the descriptor-based 2D path, causing `InferLayout` to install a swizzle-shaped shared layout via `InferBulkLayout` and forcing `Copy::Lower` to fall through to the 2D `LowerBulk`. There the `inner_box_dim > 256` branch issues `(K / 256)` separate `tl::tma_load` calls instead of a single 1D bulk copy. The 1D bulk-copy path emits `cp.async.bulk` (not `cp.async.bulk.tensor`) and has the same OOB semantics as a plain `T.copy()` - it does not need the descriptor-only OOB gate. Drop `!ctx.buffer_oob` from the 1D eligibility check so dynamic-outer-shape 1D copies that already satisfy `CheckBulkCopy1D` (contiguous innermost slice, full-extent trailing dim, element count match) keep the single-issue path. Repro: see issue tile-ai#2180 - `T.tma_copy(A[var, 0:K], a_shared, barrier=mbar)` with `M=T.dynamic('M')`, `K=T.const('K')` on `A[M, K]` was lowering to 4 split `tl::tma_load` calls; with this patch it lowers to a single `tl::tma_load(smem, gmem, mbar, total_bytes)` and no `CUtensorMap` descriptor. Verified locally on A100 at the codegen-string level (end-to-end launch requires sm_90+ and is left to CI). Tests: 3 new regression cases in `testing/python/language/test_tilelang_language_tma_1d.py` covering fp32/K=1024, fp32/K=512, fp16/K=1024 with `T.dynamic` outer shape. All existing TMA tests pass without regression.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughFixed TMA availability gating in ChangesIssue
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Fixes #2180.
T.tma_copy(A[var, 0:K], a_shared, barrier=mbar)on a[M, K]tensor withM = T.dynamic('M'),K = T.const('K')lowered to 4 splittl::tma_loadcalls (offsets0/256/512/768) using a 2DCUtensorMapdescriptor, instead of a single 1Dcp.async.bulkissuing onetl::tma_loadfor the full row.Root cause
In
AnalyzeCopyFacts(src/backend/cuda/op/copy_analysis.cc), the 1D and 2D TMA eligibility checks shared a single gate that included!ctx.buffer_oob:facts.layout_dependent_tma_available = facts.has_layout_map && !is_cutedsl && !ctx.buffer_oob; if (facts.layout_dependent_tma_available) { facts.can_bulk_load_1d = CheckBulkLoad1D(...); facts.can_bulk_store_1d = CheckBulkStore1D(...); }When
Mis dynamic andvar = T.alloc_var(init=0), the analyzer cannot provevar + 1 <= M, solayout_inference.ccsetsbuffer_oob = true. That correctly disqualifies the descriptor-based 2D path, but it also (incorrectly) suppresses 1D eligibility, so:SelectInstfalls back tokBulkLoad.InferBulkLayoutinstalls a swizzle-shaped (FloorDiv/FloorMod 256) shared layout.Copy::Lowerre-runsSelectInstwithbuffer_oob = false, but the shared layout is no longer linear, soCheckBulkCopy1Dreturns false again.LowerBulkruns, hitsinner_box_dim = 1024 > 256, and emitsFor(i, 0, 4) tma_load(...)— the four split loads in the issue.The 1D bulk-copy path emits
cp.async.bulk(notcp.async.bulk.tensor) and has the same OOB semantics as a plainT.copy(). It does not need the descriptor-only OOB gate.Fix
Drop
!ctx.buffer_oobfrom the 1D eligibility check.CheckBulkCopy1D(contiguous innermost slice, full-extent trailing dim, element count match) is the only contract the 1D path needs. The 2D descriptor path is unaffected: it depends onfacts.can_bulk_load/facts.can_bulk_store, which are computed independently below and are not gated onbuffer_oob.Verification
Pre-patch generated source for the issue's repro (extracted from the codegen string):
Post-patch:
A single 1D
cp.async.bulkissue, noCUtensorMapdescriptor — matches the expected behavior in the issue.Tests
3 new regression cases in
testing/python/language/test_tilelang_language_tma_1d.py, asserting onkernel_sourceproduced bytilelang.lower(target={"kind": "cuda", "arch": "sm_90a"}):test_issue_2180_full_row_fp32_k1024test_issue_2180_full_row_fp32_k512test_issue_2180_full_row_fp16_k1024Each asserts exactly one
tl::tma_loadand noCUtensorMapsubstring in the generated source. All three pass locally.Existing tests checked locally (A100, sm_80):
test_tilelang_language_tma_1d.py— 3 passedtest_tilelang_language_tma_copy.py— 0 failed (5 skipped, require sm_90)test_tilelang_language_tma_store.py— 0 failed (4 skipped, require sm_90)Notes
legalize_pairwise_extentsrank handling, threadinglayout_mapthroughClassify*) were considered but are not needed to fix this specific issue and are out of scope here.Summary by CodeRabbit
Bug Fixes
Tests