docs: expand comm gemm overlap guidance by omribz156 · Pull Request #3043 · NVIDIA/TransformerEngine

omribz156 · 2026-05-26T04:04:49Z

Description

Adds source-backed setup guidance to the comm_gemm_overlap example README so users can see the userbuffer initialization flow, the relationship between ub_tp_comm_overlap and individual overlap flags, and the key Hugging Face replacement considerations.

Fixes #2618

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Documented the setup order: environment, process group, initialize_ub, overlap-enabled layer construction, and destroy_ub
Added a minimal TransformerLayer setup sketch based on the existing example flow
Clarified that ub_tp_comm_overlap gates the lower-level overlap flags on TransformerLayer
Added cautious Hugging Face replacement guidance without changing example code

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Verification:

rg -n "initialize_ub|destroy_ub|ub_tp_comm_overlap|ub_overlap_ag|Hugging Face|CUDA_DEVICE_MAX_CONNECTIONS" examples\pytorch\comm_gemm_overlap\README.md examples\pytorch\comm_gemm_overlap\te_layer_with_overlap.py transformer_engine\pytorch\transformer.py
git diff --check

No runtime tests were run because this is a docs-only change for a distributed CUDA example.

This was prepared with Codex assistance, with the final docs reviewed manually against the existing example and TransformerLayer source.

Signed-off-by: Omri SirComp <omribz156@gmail.com>

greptile-apps · 2026-05-26T04:06:55Z

Greptile Summary

This PR adds a new "Enabling overlap in your own module" section to the comm_gemm_overlap README, providing a five-step setup sequence and a self-contained Python sketch that illustrates userbuffer initialization, TransformerLayer construction with all overlap flags, and teardown order.

Adds placeholder variable definitions (num_heads, head_dim, etc.) and an explicit quantization_modes argument to initialize_ub, addressing gaps that would have caused NameError or silent FP8 misconfiguration for users copying the snippet verbatim.
Documents the ub_tp_comm_overlap gating behavior over per-path flags and adds guidance for Hugging Face module-replacement workflows.

Confidence Score: 5/5

Documentation-only change with no runtime code modifications; safe to merge.

All changes are confined to a single README file. The code sketch is accurate against the actual example and the TransformerLayer source, and the two prose issues noted do not affect the existing example or any library code.

No files require special attention.

Important Files Changed

Filename	Overview
examples/pytorch/comm_gemm_overlap/README.md	Adds a new 'Enabling overlap in your own module' section with a numbered setup sequence and a runnable Python sketch; minor prose imprecision in step 1 timing and use of dist.group.WORLD sentinel vs. an explicit NCCL subgroup.

Sequence Diagram

sequenceDiagram
    participant User
    participant Env as OS Environment
    participant Dist as torch.distributed
    participant TE as transformer_engine

    User->>Env: "Set CUDA_DEVICE_MAX_CONNECTIONS=1"
    User->>Dist: "init_process_group(backend="nccl")"
    User->>Dist: "new_group(backend="nccl") → tp_group"
    User->>TE: initialize_ub([batched_size, hidden_size], tp_size, ...)
    User->>TE: "TransformerLayer(..., tp_group, ub_tp_comm_overlap=True, ...)"
    loop Training iterations
        User->>TE: forward / backward / optimizer step
    end
    User->>TE: destroy_ub()
    User->>Dist: destroy_process_group()

_{Reviews (2): Last reviewed commit: "docs: clarify comm overlap snippet defau..." | Re-trigger Greptile}

greptile-apps · 2026-05-26T04:06:59Z

+hidden_size = num_heads * head_dim
+batched_size = seq_length * batch_size
+
+te.module.base.initialize_ub(
+    [batched_size, hidden_size],
+    tp_size,
+    dtype=torch.bfloat16,
+    bootstrap_backend="nccl",
+)
+
+layer = te.TransformerLayer(
+    hidden_size,
+    4 * hidden_size,
+    num_heads,
+    tp_group=tp_group,
+    tp_size=tp_size,
+    sequence_parallel=True,
+    fuse_qkv_params=True,
+    ub_tp_comm_overlap=True,
+    ub_overlap_ag=True,
+    ub_overlap_rs=True,
+    ub_bulk_wgrad=True,
+    ub_bulk_dgrad=True,
+    seq_length=seq_length,
+)


Undefined variables in runnable code snippet

The sketch uses num_heads, head_dim, seq_length, and batch_size before they are ever assigned, so any user who copies this block verbatim will immediately hit NameError: name 'num_heads' is not defined. The actual example in te_layer_with_overlap.py derives these from parsed arguments (config.num_heads, config.head_dim, config.seq_length, config.batch_size). The README snippet should either define these variables with representative placeholder values (e.g. num_heads = 64; head_dim = 128; seq_length = 2048; batch_size = 2) or add a clear comment that they must be set by the caller before this block.

greptile-apps · 2026-05-26T04:07:00Z

+te.module.base.initialize_ub(
+    [batched_size, hidden_size],
+    tp_size,
+    dtype=torch.bfloat16,
+    bootstrap_backend="nccl",
+)


The initialize_ub call in the sketch omits the quantization_modes argument that the actual example passes. For FP8 workloads the default (None) differs from the FP8 mode used in te_layer_with_overlap.py. A brief comment noting that quantization_modes should be set for FP8 training would prevent silent misconfiguration.

Suggested change

te.module.base.initialize_ub(

[batched_size, hidden_size],

tp_size,

dtype=torch.bfloat16,

bootstrap_backend="nccl",

)

te.module.base.initialize_ub(

[batched_size, hidden_size],

tp_size,

dtype=torch.bfloat16,

# quantization_modes=[te.module.base.UserBufferQuantizationMode.FP8] # add for FP8 training

bootstrap_backend="nccl",

)

Signed-off-by: Omri SirComp <omribz156@gmail.com>

omribz156 · 2026-05-26T04:39:12Z

Thanks, that was a good catch. I pushed a follow-up that defines the dimensions used by the sketch and passes quantization_modes=[te.module.base.UserBufferQuantizationMode.NONE] explicitly, matching the source example's non-FP8 default path.

Verification:

rg -n "num_heads|head_dim|seq_length|micro_batch_size|batch_size|quantization_modes|UserBufferQuantizationMode" examples/pytorch/comm_gemm_overlap/README.md
git diff --check

This follow-up was Codex-assisted and manually checked before pushing.

docs: expand comm gemm overlap guidance

7494499

Signed-off-by: Omri SirComp <omribz156@gmail.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 26, 2026

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

docs: clarify comm overlap snippet defaults

cd3eecd

Signed-off-by: Omri SirComp <omribz156@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: expand comm gemm overlap guidance#3043

docs: expand comm gemm overlap guidance#3043
omribz156 wants to merge 2 commits into
NVIDIA:mainfrom
omribz156:codex/comm-gemm-overlap-docs

omribz156 commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

omribz156 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

omribz156 commented May 26, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

omribz156 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 26, 2026 •

edited

Loading