Skip to content

[BugFix] Consider private SMEM in WS prelude#2217

Open
Rachmanino wants to merge 2 commits into
tile-ai:mainfrom
Rachmanino:fix/ws-prelude-shared-liveness
Open

[BugFix] Consider private SMEM in WS prelude#2217
Rachmanino wants to merge 2 commits into
tile-ai:mainfrom
Rachmanino:fix/ws-prelude-shared-liveness

Conversation

@Rachmanino
Copy link
Copy Markdown
Collaborator

@Rachmanino Rachmanino commented May 18, 2026

Fix #2209.

This PR fixes a warp-specialization prelude classification issue exposed by the BF16 MXFP4 Hopper dequant GEMM with bias.

After the tiled warp-specialization pass was moved before layout inference, layout inference started seeing the WS-expanded thread extent, i.e. consumer_extent + producer_extent, instead of the original consumer
thread extent. In the failing kernel, the bias initialization sequence:

T.copy(Bias, Bias_shared)
T.copy(Bias_shared, C_local)

is consumer-only prelude work. C_local is only used by the consumer-side GEMM/epilogue, and the producer side does not need either Bias_shared or C_local.

However, the previous WS prelude classifier only tracked local/fragment buffers as branch-private dependencies. It did not track shared-memory buffers. As a result, Bias_shared was invisible to the prelude liveness
propagation:

- Bias_shared -> C_local was classified as consumer-only because C_local is a fragment buffer.
- Bias -> Bias_shared was incorrectly kept in the common prelude because Bias_shared is shared memory and was not tracked.

Keeping Bias -> Bias_shared in the common prelude made layout inference infer this SIMT copy with the full WS thread extent. For this case, the copy iteration space is 256 x 128, while the WS-expanded extent is
384, which cannot form a bijective loop layout. This eventually caused LayoutInference to fail with no available layout found.

The fix separates two concepts that were previously conflated:

1. Prelude classification/liveness now tracks local, fragment, and shared buffers. This lets shared-memory values that are only consumed by one WS branch pull their definitions into that branch.
2. Producer-side buffer cloning remains restricted to local/fragment buffers. Shared memory is not cloned by default because shared buffers often represent producer-consumer communication, pipeline storage, layout
   annotations, and barrier-managed state.

With this change, consumer-only shared-memory prelude dataflow such as Bias -> Bias_shared -> C_local is moved into the consumer branch, so layout inference sees the correct consumer thread extent instead of the
full WS thread extent.

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **Refactor**
* Improved internal buffer-tracking to include shared buffers alongside local/fragment buffers.
* Updated prelude and liveness analysis so shared-buffer usage is considered when classifying and moving computations.
* Results: more accurate dataflow classification and safer extraction/sinking decisions, reducing incorrect code motion and improving reliability of generated pipelines.

<!-- review_stack_entry_start -->

[![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/tile-ai/tilelang/pull/2217?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b9f566d-56ad-4ac5-a82a-eb190f5b8080

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0eb44 and 3e877fd.

📒 Files selected for processing (1)
  • src/transform/producer_consumer_ws.cc
💤 Files with no reviewable changes (1)
  • src/transform/producer_consumer_ws.cc

📝 Walkthrough

Walkthrough

LocalAccessCollector gains optional shared-memory buffer tracking via CollectWithShared(...) and track_shared_, access recording is unified under IsTrackedBuffer(...), and prelude classification/liveness now use the shared-aware collector.

Changes

Shared-Buffer Tracking in Prelude Analysis

Layer / File(s) Summary
LocalAccessCollector shared-buffer tracking infrastructure
src/transform/producer_consumer_ws.cc
LocalAccessCollector adds CollectWithShared(...), a track_shared_ member, and a unified IsTrackedBuffer(...) predicate controlling whether shared buffers are treated as tracked.
Access recording sites switched to IsTrackedBuffer
src/transform/producer_consumer_ws.cc
Buffer load/store recording, tile-op CopyNode/FillNode, tl::access_ptr() and builtin::tvm_access_ptr() base-buffer cases now use IsTrackedBuffer(...) so shared buffers contribute to the access summary when enabled.
Prelude analysis adoption
src/transform/producer_consumer_ws.cc
ClassifyPreludeStmt and ReplacePipelineLoopInStmt (backward liveness over prelude) now compute per-statement summaries with LocalAccessCollector::CollectWithShared(...), expanding def/use and liveness classification to include shared-buffer accesses.

Possibly Related PRs

  • tile-ai/tilelang#2055: Modifies prelude access/liveness handling in ReplacePipelineLoopInStmt and ClassifyPreludeStmt with shared-prelude handling.
  • tile-ai/tilelang#1973: Refactors LocalAccessCollector and related def/use/liveness logic across prelude/producer-consumer boundaries.

Suggested reviewers

  • LeiWang1999

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A shared buffer hops into view,
The collector now knows what to do,
Prelude truths and liveness sing,
Tracked paths lead to the right thing! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[BugFix] Consider private SMEM in WS prelude' clearly describes the main change: extending WS prelude classification to consider shared memory in addition to local/fragment buffers.
Linked Issues check ✅ Passed The PR directly addresses issue #2209 by fixing WS prelude classification to track shared-memory dataflow, enabling correct layout inference and resolving the 'no available layout found' compilation failure.
Out of Scope Changes check ✅ Passed All changes focus on extending LocalAccessCollector and prelude classification to track shared buffers, which is directly required to fix the linked issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] fail to run dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py on H20

2 participants