[Feature][BugFix] Support named barrier arrive and fix multiple bugs#2194
[Feature][BugFix] Support named barrier arrive and fix multiple bugs#2194Rachmanino wants to merge 4 commits into
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a ChangesNamed Barrier Arrive Feature and Sync Optimization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/backend/cuda/codegen/codegen_cuda.cc`:
- Around line 2345-2352: The current tl::named_barrier_arrive() branch
dereferences op->args[0]/[1] as IntImmNode without validation and doesn't check
ranges; update the branch to first verify op->args.size() == 2 and that both
op->args[i].as<IntImmNode>() are non-null before accessing ->value, and add
domain checks: ensure barrier_id is within the valid barrier ID range (e.g.,
non-negative and within any known max) and that thread_count is positive (>=1).
If any check fails, emit an appropriate error/log (ICHECK or stream error) and
avoid printing the tl::__named_barrier_arrive call; otherwise proceed to use
barrier_id and thread_count as before.
In `@src/transform/thread_storage_sync.cc`:
- Around line 1698-1718: The early disjointness check is unsound because it
proves Not(prev_constr && curr_constr) using a single shared thread-variable
space (prev, curr bound to the same iv->var), which can miss inter-instance
RAW/WAR conflicts; instead, construct a pairwise check over two distinct thread
instances: create fresh copies of the thread loop variables for the "prev" and
"curr" instances (or use two analyzers / fresh var renaming) and bind each copy
to the same iv->dom before calling analyzer.z3_prover.CanProve on
Not(prev_constr(prev_vars) && curr_constr(curr_vars)); only return false when
that pairwise proof succeeds. Ensure you update the code paths around prev,
curr, prev_constr, curr_constr and the loop over prev.threads so the solver
reasons about two separate thread instances rather than a single shared space.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d78fd24d-c8eb-4efe-b801-6967318b4da9
📒 Files selected for processing (7)
src/backend/cuda/codegen/codegen_cuda.ccsrc/op/builtin.ccsrc/op/builtin.hsrc/tl_templates/cuda/common.hsrc/transform/thread_storage_sync.cctesting/python/issue/test_tilelang_issue_2188_2192.pytilelang/language/builtin.py
| } else if (op->op.same_as(tl::named_barrier_arrive())) { | ||
| ICHECK_EQ(op->args.size(), 2U) | ||
| << "tl.named_barrier_arrive expects <barrier_id, thread_count>."; | ||
| auto barrier_id = op->args[0].as<IntImmNode>()->value; | ||
| auto thread_count = op->args[1].as<IntImmNode>()->value; | ||
| this->PrintIndent(); | ||
| this->stream << "tl::__named_barrier_arrive<" << barrier_id << ", " | ||
| << thread_count << ">();\n"; |
There was a problem hiding this comment.
Validate and guard named_barrier_arrive arguments before dereference.
This path can null-deref when either argument is non-IntImm, and it currently skips basic domain checks (barrier_id range, positive thread_count).
🔧 Proposed fix
} else if (op->op.same_as(tl::named_barrier_arrive())) {
ICHECK_EQ(op->args.size(), 2U)
<< "tl.named_barrier_arrive expects <barrier_id, thread_count>.";
- auto barrier_id = op->args[0].as<IntImmNode>()->value;
- auto thread_count = op->args[1].as<IntImmNode>()->value;
+ const auto* barrier_id_imm = op->args[0].as<IntImmNode>();
+ const auto* thread_count_imm = op->args[1].as<IntImmNode>();
+ ICHECK(barrier_id_imm && thread_count_imm)
+ << "tl.named_barrier_arrive expects compile-time IntImm "
+ << "<barrier_id, thread_count>.";
+ auto barrier_id = barrier_id_imm->value;
+ auto thread_count = thread_count_imm->value;
+ ICHECK_GE(barrier_id, 0);
+ ICHECK_LT(barrier_id, 16) << "barrier_id must be in [0, 15].";
+ ICHECK_GT(thread_count, 0) << "thread_count must be > 0.";
this->PrintIndent();
this->stream << "tl::__named_barrier_arrive<" << barrier_id << ", "
<< thread_count << ">();\n";🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/backend/cuda/codegen/codegen_cuda.cc` around lines 2345 - 2352, The
current tl::named_barrier_arrive() branch dereferences op->args[0]/[1] as
IntImmNode without validation and doesn't check ranges; update the branch to
first verify op->args.size() == 2 and that both op->args[i].as<IntImmNode>() are
non-null before accessing ->value, and add domain checks: ensure barrier_id is
within the valid barrier ID range (e.g., non-negative and within any known max)
and that thread_count is positive (>=1). If any check fails, emit an appropriate
error/log (ICHECK or stream error) and avoid printing the
tl::__named_barrier_arrive call; otherwise proceed to use barrier_id and
thread_count as before.
| // If thread constraints are mutually exclusive (e.g. tx>=128 vs tx<128), | ||
| // the two accesses belong to disjoint warp groups. Cross-group ordering | ||
| // is NOT ThreadSyncPlanner's responsibility: __syncthreads() serializes | ||
| // all threads and destroys pipeline overlap. The correct tool is mbarrier | ||
| // (T.mbarrier_arrive / T.mbarrier_wait_parity), which the caller is | ||
| // assumed to have placed. Analogous to is_async_copy: we trust the caller | ||
| // to manage synchronization between disjoint thread groups explicitly. | ||
| { | ||
| PrimExpr prev_constr = prev.cset.ToConjunction(); | ||
| PrimExpr curr_constr = curr.cset.ToConjunction(); | ||
| arith::Analyzer analyzer; | ||
| for (const auto &iv : prev.threads) { | ||
| if (iv->dom.defined()) { | ||
| analyzer.Bind(iv->var, iv->dom); | ||
| } | ||
| } | ||
| if (analyzer.z3_prover.CanProve( | ||
| tir::Not(tir::And(prev_constr, curr_constr)))) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Unsound early disjointness check can skip required RAW/WAR synchronization.
At Line 1698, Not(prev_constr && curr_constr) is proven with a single shared thread-variable space. For RAW/WAR, conflict must be checked over two thread instances; this early return can incorrectly suppress barriers for cross-thread producer/consumer communication.
Suggested fix
- // If thread constraints are mutually exclusive (e.g. tx>=128 vs tx<128),
- // the two accesses belong to disjoint warp groups. Cross-group ordering
- // is NOT ThreadSyncPlanner's responsibility: __syncthreads() serializes
- // all threads and destroys pipeline overlap. The correct tool is mbarrier
- // (T.mbarrier_arrive / T.mbarrier_wait_parity), which the caller is
- // assumed to have placed. Analogous to is_async_copy: we trust the caller
- // to manage synchronization between disjoint thread groups explicitly.
- {
- PrimExpr prev_constr = prev.cset.ToConjunction();
- PrimExpr curr_constr = curr.cset.ToConjunction();
- arith::Analyzer analyzer;
- for (const auto &iv : prev.threads) {
- if (iv->dom.defined()) {
- analyzer.Bind(iv->var, iv->dom);
- }
- }
- if (analyzer.z3_prover.CanProve(
- tir::Not(tir::And(prev_constr, curr_constr)))) {
- return false;
- }
- }
+ // Safe only when conflict semantics use a shared thread variable space
+ // (RAR/WAW). RAW/WAR need two-thread reasoning and must not short-circuit.
+ bool same_access_type = (prev.type == kWrite && curr.type == kWrite) ||
+ (prev.type == kRead && curr.type == kRead);
+ if (same_access_type) {
+ PrimExpr prev_constr = prev.cset.ToConjunction();
+ PrimExpr curr_constr = curr.cset.ToConjunction();
+ arith::Analyzer analyzer;
+ for (const auto &iv : prev.threads) {
+ if (iv->dom.defined()) {
+ analyzer.Bind(iv->var, iv->dom);
+ }
+ }
+ if (analyzer.z3_prover.CanProve(
+ tir::Not(tir::And(prev_constr, curr_constr)))) {
+ return false;
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/transform/thread_storage_sync.cc` around lines 1698 - 1718, The early
disjointness check is unsound because it proves Not(prev_constr && curr_constr)
using a single shared thread-variable space (prev, curr bound to the same
iv->var), which can miss inter-instance RAW/WAR conflicts; instead, construct a
pairwise check over two distinct thread instances: create fresh copies of the
thread loop variables for the "prev" and "curr" instances (or use two analyzers
/ fresh var renaming) and bind each copy to the same iv->dom before calling
analyzer.z3_prover.CanProve on Not(prev_constr(prev_vars) &&
curr_constr(curr_vars)); only return false when that pairwise proof succeeds.
Ensure you update the code paths around prev, curr, prev_constr, curr_constr and
the loop over prev.threads so the solver reasons about two separate thread
instances rather than a single shared space.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tilelang/language/builtin.py (1)
940-947: ⚡ Quick winAdd fast-fail validation for literal barrier arguments.
At Line 940 and Line 950, integer literals are forwarded without range/domain checks. Validating literal
barrier_id(0 <= id < 16) andthread_count(> 0) in the Python wrapper will catch invalid usage much earlier.💡 Proposed patch
+def _normalize_named_barrier_arg(name: str, value): + v = convert(value) + if isinstance(v, tir.IntImm): + iv = int(v.value) + if name == "barrier_id" and not (0 <= iv < 16): + raise ValueError(f"{name} must be in [0, 15], got {iv}.") + if name == "thread_count" and iv <= 0: + raise ValueError(f"{name} must be > 0, got {iv}.") + return v + def sync_threads(barrier_id=None, arrive_count=None): """Synchronize all threads in a block.""" args = [] if barrier_id is not None: - args.append(barrier_id) + args.append(_normalize_named_barrier_arg("barrier_id", barrier_id)) if arrive_count is not None: - args.append(arrive_count) + args.append(_normalize_named_barrier_arg("thread_count", arrive_count)) return tir.call_intrin("int32", "tir.tvm_storage_sync", "shared", *args) @@ def named_barrier_arrive(barrier_id, thread_count): @@ - return tir.call_intrin("handle", tir.op.Op.get("tl.named_barrier_arrive"), barrier_id, thread_count) + barrier_id = _normalize_named_barrier_arg("barrier_id", barrier_id) + thread_count = _normalize_named_barrier_arg("thread_count", thread_count) + return tir.call_intrin("handle", tir.op.Op.get("tl.named_barrier_arrive"), barrier_id, thread_count)Also applies to: 950-977
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/language/builtin.py` around lines 940 - 947, The wrapper function sync_threads should fast-fail on invalid literal arguments: if barrier_id is an int, validate 0 <= barrier_id < 16 and raise a ValueError with a clear message if out of range; if arrive_count is an int, validate arrive_count > 0 and raise a ValueError if not; keep existing behavior for non-literal/expressions (only validate when isinstance(..., int)). Apply the same pattern to any adjacent wrapper(s) handling barrier_id/thread_count (the same checks for integer literals) so invalid literal usage is caught early.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@testing/python/issue/test_tilelang_issue_2192.py`:
- Around line 3-4: The docstring/comments in test_tilelang_issue_2192.py use an
en dash (–) in the lines "#2188 – Feature: native T.named_barrier_arrive
(bar.arrive) API" and "#2192 – Bug: unexpected __syncthreads() inserted between
T.ws producer/consumer"; replace the en dash characters with a plain ASCII
hyphen ("-") so the lines read "#2188 - Feature: ..." and "#2192 - Bug: ..." to
satisfy Ruff RUF002 linting.
- Around line 77-85: The loop-detection is too fragile because it only triggers
when a line starts with "for (int ko" and it also ignores braces on the same
line; update the detection to a more robust test (e.g., use a regex like
r'^\s*for\s*\(.*\bint\s+ko\b' or simply check 'if stripped.startswith("for") and
"int ko" in stripped') to enter the loop-scan, initialize loop_depth using the
current line's brace counts (loop_depth = stripped.count("{") -
stripped.count("}")) when setting in_loop = True, and treat the loop as finished
whenever loop_depth <= 0 (remove the extra stripped == "}" check) so braces on
the same line or formatting changes won't cause false negatives when scanning
in_loop, in_loop, stripped, loop_depth.
---
Nitpick comments:
In `@tilelang/language/builtin.py`:
- Around line 940-947: The wrapper function sync_threads should fast-fail on
invalid literal arguments: if barrier_id is an int, validate 0 <= barrier_id <
16 and raise a ValueError with a clear message if out of range; if arrive_count
is an int, validate arrive_count > 0 and raise a ValueError if not; keep
existing behavior for non-literal/expressions (only validate when
isinstance(..., int)). Apply the same pattern to any adjacent wrapper(s)
handling barrier_id/thread_count (the same checks for integer literals) so
invalid literal usage is caught early.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4746834d-5377-438d-9ede-1a6747361001
📒 Files selected for processing (4)
src/backend/cuda/codegen/codegen_cuda.ccsrc/tl_templates/cuda/common.htesting/python/issue/test_tilelang_issue_2192.pytilelang/language/builtin.py
| #2188 – Feature: native T.named_barrier_arrive (bar.arrive) API | ||
| #2192 – Bug: unexpected __syncthreads() inserted between T.ws producer/consumer |
There was a problem hiding this comment.
Replace EN DASH with ASCII hyphen to satisfy Ruff RUF002.
Use - instead of – in the docstring lines to avoid lint noise/failures.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 3-3: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
[warning] 4-4: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@testing/python/issue/test_tilelang_issue_2192.py` around lines 3 - 4, The
docstring/comments in test_tilelang_issue_2192.py use an en dash (–) in the
lines "#2188 – Feature: native T.named_barrier_arrive (bar.arrive) API" and
"#2192 – Bug: unexpected __syncthreads() inserted between T.ws
producer/consumer"; replace the en dash characters with a plain ASCII hyphen
("-") so the lines read "#2188 - Feature: ..." and "#2192 - Bug: ..." to satisfy
Ruff RUF002 linting.
| for line in source.split("\n"): | ||
| stripped = line.strip() | ||
| if not in_loop and stripped.startswith("for (int ko"): | ||
| in_loop = True | ||
| loop_depth = 0 | ||
| if in_loop: | ||
| loop_depth += stripped.count("{") - stripped.count("}") | ||
| if loop_depth <= 0 and stripped == "}": | ||
| in_loop = False |
There was a problem hiding this comment.
Harden loop detection to avoid false-negative regression passes.
The parser only enters loop-scanning when a line starts with for (int ko. Small codegen formatting/type changes can skip detection entirely, letting spurious __syncthreads() pass unnoticed.
Proposed fix
+ import re
in_loop = False
loop_depth = 0
spurious_syncs = []
for line in source.split("\n"):
stripped = line.strip()
- if not in_loop and stripped.startswith("for (int ko"):
+ if not in_loop and re.search(r"\bfor\s*\([^)]*\bko\b", stripped):
in_loop = True
loop_depth = 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for line in source.split("\n"): | |
| stripped = line.strip() | |
| if not in_loop and stripped.startswith("for (int ko"): | |
| in_loop = True | |
| loop_depth = 0 | |
| if in_loop: | |
| loop_depth += stripped.count("{") - stripped.count("}") | |
| if loop_depth <= 0 and stripped == "}": | |
| in_loop = False | |
| import re | |
| in_loop = False | |
| loop_depth = 0 | |
| spurious_syncs = [] | |
| for line in source.split("\n"): | |
| stripped = line.strip() | |
| if not in_loop and re.search(r"\bfor\s*\([^)]*\bko\b", stripped): | |
| in_loop = True | |
| loop_depth = 0 | |
| if in_loop: | |
| loop_depth += stripped.count("{") - stripped.count("}") | |
| if loop_depth <= 0 and stripped == "}": | |
| in_loop = False |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@testing/python/issue/test_tilelang_issue_2192.py` around lines 77 - 85, The
loop-detection is too fragile because it only triggers when a line starts with
"for (int ko" and it also ignores braces on the same line; update the detection
to a more robust test (e.g., use a regex like r'^\s*for\s*\(.*\bint\s+ko\b' or
simply check 'if stripped.startswith("for") and "int ko" in stripped') to enter
the loop-scan, initialize loop_depth using the current line's brace counts
(loop_depth = stripped.count("{") - stripped.count("}")) when setting in_loop =
True, and treat the loop as finished whenever loop_depth <= 0 (remove the extra
stripped == "}" check) so braces on the same line or formatting changes won't
cause false negatives when scanning in_loop, in_loop, stripped, loop_depth.
__syncthreads()There was a problem hiding this comment.
♻️ Duplicate comments (2)
testing/python/issue/test_tilelang_issue_2192.py (2)
3-3:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReplace EN DASH with ASCII hyphen to satisfy Ruff.
Line 3 still contains
–, which triggersRUF002and can fail lint checks.Proposed fix
-#2192 – Bug: unexpected __syncthreads() inserted between T.ws producer/consumer +#2192 - Bug: unexpected __syncthreads() inserted between T.ws producer/consumer🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/issue/test_tilelang_issue_2192.py` at line 3, Update the header comment in test_tilelang_issue_2192.py: replace the EN DASH character in the line "#2192 – Bug: unexpected __syncthreads() inserted between T.ws producer/consumer" with a standard ASCII hyphen, producing "#2192 - Bug: unexpected __syncthreads() inserted between T.ws producer/consumer" to satisfy Ruff RUF002.
78-84:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden loop detection so the regression check doesn’t false-pass.
Lines 78-84 depend on exact formatting (
startswith("for (int ko")) and a brittle loop-exit condition. Small codegen changes can skip scanning and miss spurious__syncthreads();.Proposed fix
+ import re in_loop = False loop_depth = 0 spurious_syncs = [] for line in source.split("\n"): stripped = line.strip() - if not in_loop and stripped.startswith("for (int ko"): + if not in_loop and re.search(r"\bfor\s*\([^)]*\bko\b", stripped): in_loop = True - loop_depth = 0 + loop_depth = stripped.count("{") - stripped.count("}") if in_loop: loop_depth += stripped.count("{") - stripped.count("}") - if loop_depth <= 0 and stripped == "}": + if loop_depth <= 0: in_loop = False continue if "__syncthreads();" in stripped: spurious_syncs.append(line)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/issue/test_tilelang_issue_2192.py` around lines 78 - 84, The loop detection is brittle: replace the startswith("for (int ko") check with a regex or string parsing that matches a for-header more robustly (e.g. match leading whitespace then 'for' and a parenthesized header containing 'int ko' as a word) and update the loop-exit logic so you track brace balance regardless of exact line content (use loop_depth += stripped.count("{") - stripped.count("}") when in_loop and consider the loop closed when loop_depth <= 0 without requiring stripped == "}") — modify the variables in_loop and loop_depth handling in the same block (where in_loop is set and where loop_depth is adjusted) to handle braces on the same line as the header and single-line bodies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@testing/python/issue/test_tilelang_issue_2192.py`:
- Line 3: Update the header comment in test_tilelang_issue_2192.py: replace the
EN DASH character in the line "#2192 – Bug: unexpected __syncthreads() inserted
between T.ws producer/consumer" with a standard ASCII hyphen, producing "#2192 -
Bug: unexpected __syncthreads() inserted between T.ws producer/consumer" to
satisfy Ruff RUF002.
- Around line 78-84: The loop detection is brittle: replace the startswith("for
(int ko") check with a regex or string parsing that matches a for-header more
robustly (e.g. match leading whitespace then 'for' and a parenthesized header
containing 'int ko' as a word) and update the loop-exit logic so you track brace
balance regardless of exact line content (use loop_depth += stripped.count("{")
- stripped.count("}") when in_loop and consider the loop closed when loop_depth
<= 0 without requiring stripped == "}") — modify the variables in_loop and
loop_depth handling in the same block (where in_loop is set and where loop_depth
is adjusted) to handle braces on the same line as the header and single-line
bodies.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 830adf06-65ba-442f-80eb-df0af63155c1
📒 Files selected for processing (1)
testing/python/issue/test_tilelang_issue_2192.py
…2192) and add T.named_barrier_arrive (tile-ai#2188) Issue tile-ai#2192 - BugFix: Unexpected CTA-wide __syncthreads() between T.ws producer/consumer - Root cause: FindConflict() in thread_storage_sync.cc did not check for mutually exclusive thread constraints. When T.ws(0) (tx<128) reads and T.ws(1) (tx>=128) writes the same shared buffer, no thread satisfies both constraints, so no __syncthreads() is needed — mbarriers already handle the synchronization. - Fix: add mutual-exclusivity check (prove NOT(P AND C)) early in FindConflict(). If the access constraints are mutually exclusive (disjoint thread sets), return false (no conflict) immediately, before any index-range analysis. - Measured improvement: barrierpipe_stage2 gemm latency 44ms → 25ms on H200. Issue tile-ai#2188 - Feature: expose native T.named_barrier_arrive(barrier_id, thread_count) - Adds tl.named_barrier_arrive intrinsic op (bar.arrive PTX, one-sided arrive). - Registers op in builtin.cc, declares in builtin.h. - Adds tl::__named_barrier_arrive<id, cnt>() template in common.h. - Adds codegen case in codegen_cuda.cc. - Exposes T.named_barrier_arrive() in tilelang/language/builtin.py. - Adds regression test testing/python/issue/test_tilelang_issue_2188_2192.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ires_cuda Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2a0667a to
1ae77e3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tilelang/language/builtin.py`:
- Line 970: The comment for barrier_id contains an en dash (0–15) causing Ruff
RUF002; edit the text associated with barrier_id in tilelang/language/builtin.py
(the doc/comment containing "barrier_id: Named barrier index (0–15). May be a
variable (PrimExpr).") and replace the en dash with an ASCII hyphen ("0-15") so
the linter warning is resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8f937d63-16a8-4150-82b5-e73d83bb2184
📒 Files selected for processing (7)
src/backend/cuda/codegen/codegen_cuda.ccsrc/op/builtin.ccsrc/op/builtin.hsrc/tl_templates/cuda/common.hsrc/transform/thread_storage_sync.cctesting/python/issue/test_tilelang_issue_2192.pytilelang/language/builtin.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/op/builtin.h
- src/op/builtin.cc
- src/transform/thread_storage_sync.cc
- src/backend/cuda/codegen/codegen_cuda.cc
| T.sync_threads(ready_barrier, total_threads) | ||
|
|
||
| Args: | ||
| barrier_id: Named barrier index (0–15). May be a variable (PrimExpr). |
There was a problem hiding this comment.
Replace EN DASH with ASCII hyphen to avoid Ruff RUF002.
Line 970 uses –, which triggers the lint warning.
Proposed fix
- barrier_id: Named barrier index (0–15). May be a variable (PrimExpr).
+ barrier_id: Named barrier index (0-15). May be a variable (PrimExpr).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| barrier_id: Named barrier index (0–15). May be a variable (PrimExpr). | |
| barrier_id: Named barrier index (0-15). May be a variable (PrimExpr). |
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 970-970: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tilelang/language/builtin.py` at line 970, The comment for barrier_id
contains an en dash (0–15) causing Ruff RUF002; edit the text associated with
barrier_id in tilelang/language/builtin.py (the doc/comment containing
"barrier_id: Named barrier index (0–15). May be a variable (PrimExpr).") and
replace the en dash with an ASCII hyphen ("0-15") so the linter warning is
resolved.
fix #2188
fix #2192
Summary by CodeRabbit
New Features
named_barrier_arrivesynchronization primitive for CUDA kernels, enabling one-sided barrier arrival without blocking.Improvements
Tests