Skip to content

[Feature][BugFix] Support named barrier arrive and fix multiple bugs#2194

Open
Rachmanino wants to merge 4 commits into
tile-ai:mainfrom
Rachmanino:fix/issue-2188-2192
Open

[Feature][BugFix] Support named barrier arrive and fix multiple bugs#2194
Rachmanino wants to merge 4 commits into
tile-ai:mainfrom
Rachmanino:fix/issue-2188-2192

Conversation

@Rachmanino
Copy link
Copy Markdown
Collaborator

@Rachmanino Rachmanino commented May 13, 2026

fix #2188
fix #2192

Summary by CodeRabbit

  • New Features

    • Added named_barrier_arrive synchronization primitive for CUDA kernels, enabling one-sided barrier arrival without blocking.
    • Thread barrier synchronization now supports runtime parameters instead of compile-time only.
  • Improvements

    • Optimized synchronization for non-overlapping thread participation scenarios, reducing unnecessary barriers.
  • Tests

    • Added regression test for named barrier synchronization functionality.

Review Change Stack

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a named_barrier_arrive() intrinsic for one-sided CTA named barrier arrival, refactors barrier sync helpers from compile-time templates to runtime parameters, and optimizes away spurious synchronization barriers via Z3-based thread participation analysis. A regression test validates pipeline correctness without unwanted __syncthreads() calls.

Changes

Named Barrier Arrive Feature and Sync Optimization

Layer / File(s) Summary
Builtin operation definition and registration
src/op/builtin.h, src/op/builtin.cc
named_barrier_arrive() TIR builtin is declared with exported TVM_DLL signature and registered as a 2-input TL intrinsic with opaque call effect.
CUDA runtime barrier templates
src/tl_templates/cuda/common.h
__sync_thread_partial() converted from template-based to runtime-parameter helper accepting barrier_id and thread_count arguments; new __named_barrier_arrive(barrier_id, thread_count) helper added to emit bar.arrive inline PTX.
CUDA code generation visitor
src/backend/cuda/codegen/codegen_cuda.cc
VisitExpr_ case added for tl::named_barrier_arrive() that validates 2-argument arity and emits tl::__named_barrier_arrive(barrier_id, thread_count); call.
Python API wrapper
tilelang/language/builtin.py
New public named_barrier_arrive(barrier_id, thread_count) function wraps the TIR intrinsic for kernel authoring.
Thread sync conflict detection optimization
src/transform/thread_storage_sync.cc
FindConflict adds Z3-based early check: when thread participation constraints from two accesses are provably disjoint (via CanProve(Not(And(...)))), the function returns false to skip barrier insertion, enabling warp-specialized producer/consumer patterns.
Regression test for pipeline and sync correctness
testing/python/issue/test_tilelang_issue_2192.py
CUDA-only JIT test validates parity-based pipeline (T.ws scopes with mbarrier coordination) generates no spurious __syncthreads() in the main loop and produces correct matmul results.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tile-ai/tilelang#1760: Both PRs modify FindConflict logic in src/transform/thread_storage_sync.cc to use Z3-based constraint reasoning for barrier insertion decisions.
  • tile-ai/tilelang#2197: Both PRs update barrier sync plumbing to accept runtime barrier_id and thread_count parameters, affecting CUDA codegen and template emission.
  • tile-ai/tilelang#1146: Both PRs modify FindConflict and synchronization planning logic in src/transform/thread_storage_sync.cc.

Suggested labels

enhancement

Suggested reviewers

  • LeiWang1999

Poem

🐰 A barrier that arrives, then waits no more,
Now threads can split their labor, floor by floor,
Producer and consumer, side by side,
With named-barrier sync to synchronize their stride,
The pipeline flows, no spurious locks in sight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: adding support for named barrier arrive (Feature) and fixing synchronization bugs (#2192, #2193).
Linked Issues check ✅ Passed All three linked issues (#2188, #2192, #2193) are addressed: named_barrier_arrive API added, CTA-wide __syncthreads() between T.ws scopes prevented, and variable barrier IDs now supported.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the three linked issues: barrier arrive support, thread storage sync optimization, runtime barrier IDs, and test coverage.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Rachmanino Rachmanino requested a review from LeiWang1999 May 13, 2026 06:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/cuda/codegen/codegen_cuda.cc`:
- Around line 2345-2352: The current tl::named_barrier_arrive() branch
dereferences op->args[0]/[1] as IntImmNode without validation and doesn't check
ranges; update the branch to first verify op->args.size() == 2 and that both
op->args[i].as<IntImmNode>() are non-null before accessing ->value, and add
domain checks: ensure barrier_id is within the valid barrier ID range (e.g.,
non-negative and within any known max) and that thread_count is positive (>=1).
If any check fails, emit an appropriate error/log (ICHECK or stream error) and
avoid printing the tl::__named_barrier_arrive call; otherwise proceed to use
barrier_id and thread_count as before.

In `@src/transform/thread_storage_sync.cc`:
- Around line 1698-1718: The early disjointness check is unsound because it
proves Not(prev_constr && curr_constr) using a single shared thread-variable
space (prev, curr bound to the same iv->var), which can miss inter-instance
RAW/WAR conflicts; instead, construct a pairwise check over two distinct thread
instances: create fresh copies of the thread loop variables for the "prev" and
"curr" instances (or use two analyzers / fresh var renaming) and bind each copy
to the same iv->dom before calling analyzer.z3_prover.CanProve on
Not(prev_constr(prev_vars) && curr_constr(curr_vars)); only return false when
that pairwise proof succeeds. Ensure you update the code paths around prev,
curr, prev_constr, curr_constr and the loop over prev.threads so the solver
reasons about two separate thread instances rather than a single shared space.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d78fd24d-c8eb-4efe-b801-6967318b4da9

📥 Commits

Reviewing files that changed from the base of the PR and between bcb2da3 and f892cab.

📒 Files selected for processing (7)
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/tl_templates/cuda/common.h
  • src/transform/thread_storage_sync.cc
  • testing/python/issue/test_tilelang_issue_2188_2192.py
  • tilelang/language/builtin.py

Comment on lines +2345 to +2352
} else if (op->op.same_as(tl::named_barrier_arrive())) {
ICHECK_EQ(op->args.size(), 2U)
<< "tl.named_barrier_arrive expects <barrier_id, thread_count>.";
auto barrier_id = op->args[0].as<IntImmNode>()->value;
auto thread_count = op->args[1].as<IntImmNode>()->value;
this->PrintIndent();
this->stream << "tl::__named_barrier_arrive<" << barrier_id << ", "
<< thread_count << ">();\n";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Validate and guard named_barrier_arrive arguments before dereference.

This path can null-deref when either argument is non-IntImm, and it currently skips basic domain checks (barrier_id range, positive thread_count).

🔧 Proposed fix
   } else if (op->op.same_as(tl::named_barrier_arrive())) {
     ICHECK_EQ(op->args.size(), 2U)
         << "tl.named_barrier_arrive expects <barrier_id, thread_count>.";
-    auto barrier_id = op->args[0].as<IntImmNode>()->value;
-    auto thread_count = op->args[1].as<IntImmNode>()->value;
+    const auto* barrier_id_imm = op->args[0].as<IntImmNode>();
+    const auto* thread_count_imm = op->args[1].as<IntImmNode>();
+    ICHECK(barrier_id_imm && thread_count_imm)
+        << "tl.named_barrier_arrive expects compile-time IntImm "
+        << "<barrier_id, thread_count>.";
+    auto barrier_id = barrier_id_imm->value;
+    auto thread_count = thread_count_imm->value;
+    ICHECK_GE(barrier_id, 0);
+    ICHECK_LT(barrier_id, 16) << "barrier_id must be in [0, 15].";
+    ICHECK_GT(thread_count, 0) << "thread_count must be > 0.";
     this->PrintIndent();
     this->stream << "tl::__named_barrier_arrive<" << barrier_id << ", "
                  << thread_count << ">();\n";
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/cuda/codegen/codegen_cuda.cc` around lines 2345 - 2352, The
current tl::named_barrier_arrive() branch dereferences op->args[0]/[1] as
IntImmNode without validation and doesn't check ranges; update the branch to
first verify op->args.size() == 2 and that both op->args[i].as<IntImmNode>() are
non-null before accessing ->value, and add domain checks: ensure barrier_id is
within the valid barrier ID range (e.g., non-negative and within any known max)
and that thread_count is positive (>=1). If any check fails, emit an appropriate
error/log (ICHECK or stream error) and avoid printing the
tl::__named_barrier_arrive call; otherwise proceed to use barrier_id and
thread_count as before.

Comment on lines +1698 to +1718
// If thread constraints are mutually exclusive (e.g. tx>=128 vs tx<128),
// the two accesses belong to disjoint warp groups. Cross-group ordering
// is NOT ThreadSyncPlanner's responsibility: __syncthreads() serializes
// all threads and destroys pipeline overlap. The correct tool is mbarrier
// (T.mbarrier_arrive / T.mbarrier_wait_parity), which the caller is
// assumed to have placed. Analogous to is_async_copy: we trust the caller
// to manage synchronization between disjoint thread groups explicitly.
{
PrimExpr prev_constr = prev.cset.ToConjunction();
PrimExpr curr_constr = curr.cset.ToConjunction();
arith::Analyzer analyzer;
for (const auto &iv : prev.threads) {
if (iv->dom.defined()) {
analyzer.Bind(iv->var, iv->dom);
}
}
if (analyzer.z3_prover.CanProve(
tir::Not(tir::And(prev_constr, curr_constr)))) {
return false;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Unsound early disjointness check can skip required RAW/WAR synchronization.

At Line 1698, Not(prev_constr && curr_constr) is proven with a single shared thread-variable space. For RAW/WAR, conflict must be checked over two thread instances; this early return can incorrectly suppress barriers for cross-thread producer/consumer communication.

Suggested fix
-    // If thread constraints are mutually exclusive (e.g. tx>=128 vs tx<128),
-    // the two accesses belong to disjoint warp groups. Cross-group ordering
-    // is NOT ThreadSyncPlanner's responsibility: __syncthreads() serializes
-    // all threads and destroys pipeline overlap. The correct tool is mbarrier
-    // (T.mbarrier_arrive / T.mbarrier_wait_parity), which the caller is
-    // assumed to have placed. Analogous to is_async_copy: we trust the caller
-    // to manage synchronization between disjoint thread groups explicitly.
-    {
-      PrimExpr prev_constr = prev.cset.ToConjunction();
-      PrimExpr curr_constr = curr.cset.ToConjunction();
-      arith::Analyzer analyzer;
-      for (const auto &iv : prev.threads) {
-        if (iv->dom.defined()) {
-          analyzer.Bind(iv->var, iv->dom);
-        }
-      }
-      if (analyzer.z3_prover.CanProve(
-              tir::Not(tir::And(prev_constr, curr_constr)))) {
-        return false;
-      }
-    }
+    // Safe only when conflict semantics use a shared thread variable space
+    // (RAR/WAW). RAW/WAR need two-thread reasoning and must not short-circuit.
+    bool same_access_type = (prev.type == kWrite && curr.type == kWrite) ||
+                            (prev.type == kRead && curr.type == kRead);
+    if (same_access_type) {
+      PrimExpr prev_constr = prev.cset.ToConjunction();
+      PrimExpr curr_constr = curr.cset.ToConjunction();
+      arith::Analyzer analyzer;
+      for (const auto &iv : prev.threads) {
+        if (iv->dom.defined()) {
+          analyzer.Bind(iv->var, iv->dom);
+        }
+      }
+      if (analyzer.z3_prover.CanProve(
+              tir::Not(tir::And(prev_constr, curr_constr)))) {
+        return false;
+      }
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/transform/thread_storage_sync.cc` around lines 1698 - 1718, The early
disjointness check is unsound because it proves Not(prev_constr && curr_constr)
using a single shared thread-variable space (prev, curr bound to the same
iv->var), which can miss inter-instance RAW/WAR conflicts; instead, construct a
pairwise check over two distinct thread instances: create fresh copies of the
thread loop variables for the "prev" and "curr" instances (or use two analyzers
/ fresh var renaming) and bind each copy to the same iv->dom before calling
analyzer.z3_prover.CanProve on Not(prev_constr(prev_vars) &&
curr_constr(curr_vars)); only return false when that pairwise proof succeeds.
Ensure you update the code paths around prev, curr, prev_constr, curr_constr and
the loop over prev.threads so the solver reasons about two separate thread
instances rather than a single shared space.

@Rachmanino Rachmanino removed the request for review from LeiWang1999 May 13, 2026 06:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tilelang/language/builtin.py (1)

940-947: ⚡ Quick win

Add fast-fail validation for literal barrier arguments.

At Line 940 and Line 950, integer literals are forwarded without range/domain checks. Validating literal barrier_id (0 <= id < 16) and thread_count (> 0) in the Python wrapper will catch invalid usage much earlier.

💡 Proposed patch
+def _normalize_named_barrier_arg(name: str, value):
+    v = convert(value)
+    if isinstance(v, tir.IntImm):
+        iv = int(v.value)
+        if name == "barrier_id" and not (0 <= iv < 16):
+            raise ValueError(f"{name} must be in [0, 15], got {iv}.")
+        if name == "thread_count" and iv <= 0:
+            raise ValueError(f"{name} must be > 0, got {iv}.")
+    return v
+
 def sync_threads(barrier_id=None, arrive_count=None):
     """Synchronize all threads in a block."""
     args = []
     if barrier_id is not None:
-        args.append(barrier_id)
+        args.append(_normalize_named_barrier_arg("barrier_id", barrier_id))
     if arrive_count is not None:
-        args.append(arrive_count)
+        args.append(_normalize_named_barrier_arg("thread_count", arrive_count))
     return tir.call_intrin("int32", "tir.tvm_storage_sync", "shared", *args)
@@
 def named_barrier_arrive(barrier_id, thread_count):
@@
-    return tir.call_intrin("handle", tir.op.Op.get("tl.named_barrier_arrive"), barrier_id, thread_count)
+    barrier_id = _normalize_named_barrier_arg("barrier_id", barrier_id)
+    thread_count = _normalize_named_barrier_arg("thread_count", thread_count)
+    return tir.call_intrin("handle", tir.op.Op.get("tl.named_barrier_arrive"), barrier_id, thread_count)

Also applies to: 950-977

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/builtin.py` around lines 940 - 947, The wrapper function
sync_threads should fast-fail on invalid literal arguments: if barrier_id is an
int, validate 0 <= barrier_id < 16 and raise a ValueError with a clear message
if out of range; if arrive_count is an int, validate arrive_count > 0 and raise
a ValueError if not; keep existing behavior for non-literal/expressions (only
validate when isinstance(..., int)). Apply the same pattern to any adjacent
wrapper(s) handling barrier_id/thread_count (the same checks for integer
literals) so invalid literal usage is caught early.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@testing/python/issue/test_tilelang_issue_2192.py`:
- Around line 3-4: The docstring/comments in test_tilelang_issue_2192.py use an
en dash (–) in the lines "#2188 – Feature: native T.named_barrier_arrive
(bar.arrive) API" and "#2192 – Bug: unexpected __syncthreads() inserted between
T.ws producer/consumer"; replace the en dash characters with a plain ASCII
hyphen ("-") so the lines read "#2188 - Feature: ..." and "#2192 - Bug: ..." to
satisfy Ruff RUF002 linting.
- Around line 77-85: The loop-detection is too fragile because it only triggers
when a line starts with "for (int ko" and it also ignores braces on the same
line; update the detection to a more robust test (e.g., use a regex like
r'^\s*for\s*\(.*\bint\s+ko\b' or simply check 'if stripped.startswith("for") and
"int ko" in stripped') to enter the loop-scan, initialize loop_depth using the
current line's brace counts (loop_depth = stripped.count("{") -
stripped.count("}")) when setting in_loop = True, and treat the loop as finished
whenever loop_depth <= 0 (remove the extra stripped == "}" check) so braces on
the same line or formatting changes won't cause false negatives when scanning
in_loop, in_loop, stripped, loop_depth.

---

Nitpick comments:
In `@tilelang/language/builtin.py`:
- Around line 940-947: The wrapper function sync_threads should fast-fail on
invalid literal arguments: if barrier_id is an int, validate 0 <= barrier_id <
16 and raise a ValueError with a clear message if out of range; if arrive_count
is an int, validate arrive_count > 0 and raise a ValueError if not; keep
existing behavior for non-literal/expressions (only validate when
isinstance(..., int)). Apply the same pattern to any adjacent wrapper(s)
handling barrier_id/thread_count (the same checks for integer literals) so
invalid literal usage is caught early.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4746834d-5377-438d-9ede-1a6747361001

📥 Commits

Reviewing files that changed from the base of the PR and between f892cab and 5fd3cec.

📒 Files selected for processing (4)
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/common.h
  • testing/python/issue/test_tilelang_issue_2192.py
  • tilelang/language/builtin.py

Comment on lines +3 to +4
#2188 – Feature: native T.named_barrier_arrive (bar.arrive) API
#2192 – Bug: unexpected __syncthreads() inserted between T.ws producer/consumer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace EN DASH with ASCII hyphen to satisfy Ruff RUF002.

Use - instead of in the docstring lines to avoid lint noise/failures.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 3-3: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)


[warning] 4-4: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/issue/test_tilelang_issue_2192.py` around lines 3 - 4, The
docstring/comments in test_tilelang_issue_2192.py use an en dash (–) in the
lines "#2188 – Feature: native T.named_barrier_arrive (bar.arrive) API" and
"#2192 – Bug: unexpected __syncthreads() inserted between T.ws
producer/consumer"; replace the en dash characters with a plain ASCII hyphen
("-") so the lines read "#2188 - Feature: ..." and "#2192 - Bug: ..." to satisfy
Ruff RUF002 linting.

Comment on lines +77 to +85
for line in source.split("\n"):
stripped = line.strip()
if not in_loop and stripped.startswith("for (int ko"):
in_loop = True
loop_depth = 0
if in_loop:
loop_depth += stripped.count("{") - stripped.count("}")
if loop_depth <= 0 and stripped == "}":
in_loop = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden loop detection to avoid false-negative regression passes.

The parser only enters loop-scanning when a line starts with for (int ko. Small codegen formatting/type changes can skip detection entirely, letting spurious __syncthreads() pass unnoticed.

Proposed fix
+    import re
     in_loop = False
     loop_depth = 0
     spurious_syncs = []
     for line in source.split("\n"):
         stripped = line.strip()
-        if not in_loop and stripped.startswith("for (int ko"):
+        if not in_loop and re.search(r"\bfor\s*\([^)]*\bko\b", stripped):
             in_loop = True
             loop_depth = 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for line in source.split("\n"):
stripped = line.strip()
if not in_loop and stripped.startswith("for (int ko"):
in_loop = True
loop_depth = 0
if in_loop:
loop_depth += stripped.count("{") - stripped.count("}")
if loop_depth <= 0 and stripped == "}":
in_loop = False
import re
in_loop = False
loop_depth = 0
spurious_syncs = []
for line in source.split("\n"):
stripped = line.strip()
if not in_loop and re.search(r"\bfor\s*\([^)]*\bko\b", stripped):
in_loop = True
loop_depth = 0
if in_loop:
loop_depth += stripped.count("{") - stripped.count("}")
if loop_depth <= 0 and stripped == "}":
in_loop = False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/issue/test_tilelang_issue_2192.py` around lines 77 - 85, The
loop-detection is too fragile because it only triggers when a line starts with
"for (int ko" and it also ignores braces on the same line; update the detection
to a more robust test (e.g., use a regex like r'^\s*for\s*\(.*\bint\s+ko\b' or
simply check 'if stripped.startswith("for") and "int ko" in stripped') to enter
the loop-scan, initialize loop_depth using the current line's brace counts
(loop_depth = stripped.count("{") - stripped.count("}")) when setting in_loop =
True, and treat the loop as finished whenever loop_depth <= 0 (remove the extra
stripped == "}" check) so braces on the same line or formatting changes won't
cause false negatives when scanning in_loop, in_loop, stripped, loop_depth.

@Rachmanino Rachmanino changed the title [Feature][BugFix] Support named barrier arrive and remove unnecessary __syncthreads() [Feature][BugFix] Support named barrier arrive and fix multiple bugs May 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
testing/python/issue/test_tilelang_issue_2192.py (2)

3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace EN DASH with ASCII hyphen to satisfy Ruff.

Line 3 still contains , which triggers RUF002 and can fail lint checks.

Proposed fix
-#2192 – Bug: unexpected __syncthreads() inserted between T.ws producer/consumer
+#2192 - Bug: unexpected __syncthreads() inserted between T.ws producer/consumer
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/issue/test_tilelang_issue_2192.py` at line 3, Update the
header comment in test_tilelang_issue_2192.py: replace the EN DASH character in
the line "#2192 – Bug: unexpected __syncthreads() inserted between T.ws
producer/consumer" with a standard ASCII hyphen, producing "#2192 - Bug:
unexpected __syncthreads() inserted between T.ws producer/consumer" to satisfy
Ruff RUF002.

78-84: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden loop detection so the regression check doesn’t false-pass.

Lines 78-84 depend on exact formatting (startswith("for (int ko")) and a brittle loop-exit condition. Small codegen changes can skip scanning and miss spurious __syncthreads();.

Proposed fix
+    import re
     in_loop = False
     loop_depth = 0
     spurious_syncs = []
     for line in source.split("\n"):
         stripped = line.strip()
-        if not in_loop and stripped.startswith("for (int ko"):
+        if not in_loop and re.search(r"\bfor\s*\([^)]*\bko\b", stripped):
             in_loop = True
-            loop_depth = 0
+            loop_depth = stripped.count("{") - stripped.count("}")
         if in_loop:
             loop_depth += stripped.count("{") - stripped.count("}")
-            if loop_depth <= 0 and stripped == "}":
+            if loop_depth <= 0:
                 in_loop = False
                 continue
             if "__syncthreads();" in stripped:
                 spurious_syncs.append(line)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/issue/test_tilelang_issue_2192.py` around lines 78 - 84, The
loop detection is brittle: replace the startswith("for (int ko") check with a
regex or string parsing that matches a for-header more robustly (e.g. match
leading whitespace then 'for' and a parenthesized header containing 'int ko' as
a word) and update the loop-exit logic so you track brace balance regardless of
exact line content (use loop_depth += stripped.count("{") - stripped.count("}")
when in_loop and consider the loop closed when loop_depth <= 0 without requiring
stripped == "}") — modify the variables in_loop and loop_depth handling in the
same block (where in_loop is set and where loop_depth is adjusted) to handle
braces on the same line as the header and single-line bodies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@testing/python/issue/test_tilelang_issue_2192.py`:
- Line 3: Update the header comment in test_tilelang_issue_2192.py: replace the
EN DASH character in the line "#2192 – Bug: unexpected __syncthreads() inserted
between T.ws producer/consumer" with a standard ASCII hyphen, producing "#2192 -
Bug: unexpected __syncthreads() inserted between T.ws producer/consumer" to
satisfy Ruff RUF002.
- Around line 78-84: The loop detection is brittle: replace the startswith("for
(int ko") check with a regex or string parsing that matches a for-header more
robustly (e.g. match leading whitespace then 'for' and a parenthesized header
containing 'int ko' as a word) and update the loop-exit logic so you track brace
balance regardless of exact line content (use loop_depth += stripped.count("{")
- stripped.count("}") when in_loop and consider the loop closed when loop_depth
<= 0 without requiring stripped == "}") — modify the variables in_loop and
loop_depth handling in the same block (where in_loop is set and where loop_depth
is adjusted) to handle braces on the same line as the header and single-line
bodies.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 830adf06-65ba-442f-80eb-df0af63155c1

📥 Commits

Reviewing files that changed from the base of the PR and between 2438a92 and 2a0667a.

📒 Files selected for processing (1)
  • testing/python/issue/test_tilelang_issue_2192.py

Rachmanino and others added 4 commits May 18, 2026 15:34
…2192) and add T.named_barrier_arrive (tile-ai#2188)

Issue tile-ai#2192 - BugFix: Unexpected CTA-wide __syncthreads() between T.ws producer/consumer
  - Root cause: FindConflict() in thread_storage_sync.cc did not check for mutually
    exclusive thread constraints. When T.ws(0) (tx<128) reads and T.ws(1) (tx>=128)
    writes the same shared buffer, no thread satisfies both constraints, so no
    __syncthreads() is needed — mbarriers already handle the synchronization.
  - Fix: add mutual-exclusivity check (prove NOT(P AND C)) early in FindConflict().
    If the access constraints are mutually exclusive (disjoint thread sets), return
    false (no conflict) immediately, before any index-range analysis.
  - Measured improvement: barrierpipe_stage2 gemm latency 44ms → 25ms on H200.

Issue tile-ai#2188 - Feature: expose native T.named_barrier_arrive(barrier_id, thread_count)
  - Adds tl.named_barrier_arrive intrinsic op (bar.arrive PTX, one-sided arrive).
  - Registers op in builtin.cc, declares in builtin.h.
  - Adds tl::__named_barrier_arrive<id, cnt>() template in common.h.
  - Adds codegen case in codegen_cuda.cc.
  - Exposes T.named_barrier_arrive() in tilelang/language/builtin.py.
  - Adds regression test testing/python/issue/test_tilelang_issue_2188_2192.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ires_cuda

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Rachmanino Rachmanino force-pushed the fix/issue-2188-2192 branch from 2a0667a to 1ae77e3 Compare May 18, 2026 07:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tilelang/language/builtin.py`:
- Line 970: The comment for barrier_id contains an en dash (0–15) causing Ruff
RUF002; edit the text associated with barrier_id in tilelang/language/builtin.py
(the doc/comment containing "barrier_id:   Named barrier index (0–15). May be a
variable (PrimExpr).") and replace the en dash with an ASCII hyphen ("0-15") so
the linter warning is resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f937d63-16a8-4150-82b5-e73d83bb2184

📥 Commits

Reviewing files that changed from the base of the PR and between 2a0667a and 1ae77e3.

📒 Files selected for processing (7)
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/tl_templates/cuda/common.h
  • src/transform/thread_storage_sync.cc
  • testing/python/issue/test_tilelang_issue_2192.py
  • tilelang/language/builtin.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/op/builtin.h
  • src/op/builtin.cc
  • src/transform/thread_storage_sync.cc
  • src/backend/cuda/codegen/codegen_cuda.cc

T.sync_threads(ready_barrier, total_threads)

Args:
barrier_id: Named barrier index (0–15). May be a variable (PrimExpr).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace EN DASH with ASCII hyphen to avoid Ruff RUF002.

Line 970 uses , which triggers the lint warning.

Proposed fix
-        barrier_id:   Named barrier index (0–15). May be a variable (PrimExpr).
+        barrier_id:   Named barrier index (0-15). May be a variable (PrimExpr).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
barrier_id: Named barrier index (015). May be a variable (PrimExpr).
barrier_id: Named barrier index (0-15). May be a variable (PrimExpr).
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 970-970: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/builtin.py` at line 970, The comment for barrier_id
contains an en dash (0–15) causing Ruff RUF002; edit the text associated with
barrier_id in tilelang/language/builtin.py (the doc/comment containing
"barrier_id:   Named barrier index (0–15). May be a variable (PrimExpr).") and
replace the en dash with an ASCII hyphen ("0-15") so the linter warning is
resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant