Skip to content

docs: expand comm gemm overlap guidance#3043

Open
omribz156 wants to merge 2 commits into
NVIDIA:mainfrom
omribz156:codex/comm-gemm-overlap-docs
Open

docs: expand comm gemm overlap guidance#3043
omribz156 wants to merge 2 commits into
NVIDIA:mainfrom
omribz156:codex/comm-gemm-overlap-docs

Conversation

@omribz156
Copy link
Copy Markdown

Description

Adds source-backed setup guidance to the comm_gemm_overlap example README so users can see the userbuffer initialization flow, the relationship between ub_tp_comm_overlap and individual overlap flags, and the key Hugging Face replacement considerations.

Fixes #2618

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Documented the setup order: environment, process group, initialize_ub, overlap-enabled layer construction, and destroy_ub
  • Added a minimal TransformerLayer setup sketch based on the existing example flow
  • Clarified that ub_tp_comm_overlap gates the lower-level overlap flags on TransformerLayer
  • Added cautious Hugging Face replacement guidance without changing example code

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Verification:

rg -n "initialize_ub|destroy_ub|ub_tp_comm_overlap|ub_overlap_ag|Hugging Face|CUDA_DEVICE_MAX_CONNECTIONS" examples\pytorch\comm_gemm_overlap\README.md examples\pytorch\comm_gemm_overlap\te_layer_with_overlap.py transformer_engine\pytorch\transformer.py
git diff --check

No runtime tests were run because this is a docs-only change for a distributed CUDA example.

This was prepared with Codex assistance, with the final docs reviewed manually against the existing example and TransformerLayer source.

Signed-off-by: Omri SirComp <omribz156@gmail.com>
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 26, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR adds a new "Enabling overlap in your own module" section to the comm_gemm_overlap README, providing a five-step setup sequence and a self-contained Python sketch that illustrates userbuffer initialization, TransformerLayer construction with all overlap flags, and teardown order.

  • Adds placeholder variable definitions (num_heads, head_dim, etc.) and an explicit quantization_modes argument to initialize_ub, addressing gaps that would have caused NameError or silent FP8 misconfiguration for users copying the snippet verbatim.
  • Documents the ub_tp_comm_overlap gating behavior over per-path flags and adds guidance for Hugging Face module-replacement workflows.

Confidence Score: 5/5

Documentation-only change with no runtime code modifications; safe to merge.

All changes are confined to a single README file. The code sketch is accurate against the actual example and the TransformerLayer source, and the two prose issues noted do not affect the existing example or any library code.

No files require special attention.

Important Files Changed

Filename Overview
examples/pytorch/comm_gemm_overlap/README.md Adds a new 'Enabling overlap in your own module' section with a numbered setup sequence and a runnable Python sketch; minor prose imprecision in step 1 timing and use of dist.group.WORLD sentinel vs. an explicit NCCL subgroup.

Sequence Diagram

sequenceDiagram
    participant User
    participant Env as OS Environment
    participant Dist as torch.distributed
    participant TE as transformer_engine

    User->>Env: "Set CUDA_DEVICE_MAX_CONNECTIONS=1"
    User->>Dist: "init_process_group(backend="nccl")"
    User->>Dist: "new_group(backend="nccl") → tp_group"
    User->>TE: initialize_ub([batched_size, hidden_size], tp_size, ...)
    User->>TE: "TransformerLayer(..., tp_group, ub_tp_comm_overlap=True, ...)"
    loop Training iterations
        User->>TE: forward / backward / optimizer step
    end
    User->>TE: destroy_ub()
    User->>Dist: destroy_process_group()
Loading

Reviews (2): Last reviewed commit: "docs: clarify comm overlap snippet defau..." | Re-trigger Greptile

Comment on lines +37 to +61
hidden_size = num_heads * head_dim
batched_size = seq_length * batch_size

te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
dtype=torch.bfloat16,
bootstrap_backend="nccl",
)

layer = te.TransformerLayer(
hidden_size,
4 * hidden_size,
num_heads,
tp_group=tp_group,
tp_size=tp_size,
sequence_parallel=True,
fuse_qkv_params=True,
ub_tp_comm_overlap=True,
ub_overlap_ag=True,
ub_overlap_rs=True,
ub_bulk_wgrad=True,
ub_bulk_dgrad=True,
seq_length=seq_length,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undefined variables in runnable code snippet

The sketch uses num_heads, head_dim, seq_length, and batch_size before they are ever assigned, so any user who copies this block verbatim will immediately hit NameError: name 'num_heads' is not defined. The actual example in te_layer_with_overlap.py derives these from parsed arguments (config.num_heads, config.head_dim, config.seq_length, config.batch_size). The README snippet should either define these variables with representative placeholder values (e.g. num_heads = 64; head_dim = 128; seq_length = 2048; batch_size = 2) or add a clear comment that they must be set by the caller before this block.

Comment on lines +40 to +45
te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
dtype=torch.bfloat16,
bootstrap_backend="nccl",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The initialize_ub call in the sketch omits the quantization_modes argument that the actual example passes. For FP8 workloads the default (None) differs from the FP8 mode used in te_layer_with_overlap.py. A brief comment noting that quantization_modes should be set for FP8 training would prevent silent misconfiguration.

Suggested change
te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
dtype=torch.bfloat16,
bootstrap_backend="nccl",
)
te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
dtype=torch.bfloat16,
# quantization_modes=[te.module.base.UserBufferQuantizationMode.FP8] # add for FP8 training
bootstrap_backend="nccl",
)

Signed-off-by: Omri SirComp <omribz156@gmail.com>
@omribz156
Copy link
Copy Markdown
Author

Thanks, that was a good catch. I pushed a follow-up that defines the dimensions used by the sketch and passes quantization_modes=[te.module.base.UserBufferQuantizationMode.NONE] explicitly, matching the source example's non-FP8 default path.

Verification:

  • rg -n "num_heads|head_dim|seq_length|micro_batch_size|batch_size|quantization_modes|UserBufferQuantizationMode" examples/pytorch/comm_gemm_overlap/README.md
  • git diff --check

This follow-up was Codex-assisted and manually checked before pushing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve comm_gemm_overlap documentation and HuggingFace integration guidance

1 participant