docs: expand comm gemm overlap guidance#3043
Conversation
Signed-off-by: Omri SirComp <omribz156@gmail.com>
Greptile SummaryThis PR adds a new "Enabling overlap in your own module" section to the
Confidence Score: 5/5Documentation-only change with no runtime code modifications; safe to merge. All changes are confined to a single README file. The code sketch is accurate against the actual example and the TransformerLayer source, and the two prose issues noted do not affect the existing example or any library code. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Env as OS Environment
participant Dist as torch.distributed
participant TE as transformer_engine
User->>Env: "Set CUDA_DEVICE_MAX_CONNECTIONS=1"
User->>Dist: "init_process_group(backend="nccl")"
User->>Dist: "new_group(backend="nccl") → tp_group"
User->>TE: initialize_ub([batched_size, hidden_size], tp_size, ...)
User->>TE: "TransformerLayer(..., tp_group, ub_tp_comm_overlap=True, ...)"
loop Training iterations
User->>TE: forward / backward / optimizer step
end
User->>TE: destroy_ub()
User->>Dist: destroy_process_group()
Reviews (2): Last reviewed commit: "docs: clarify comm overlap snippet defau..." | Re-trigger Greptile |
| hidden_size = num_heads * head_dim | ||
| batched_size = seq_length * batch_size | ||
|
|
||
| te.module.base.initialize_ub( | ||
| [batched_size, hidden_size], | ||
| tp_size, | ||
| dtype=torch.bfloat16, | ||
| bootstrap_backend="nccl", | ||
| ) | ||
|
|
||
| layer = te.TransformerLayer( | ||
| hidden_size, | ||
| 4 * hidden_size, | ||
| num_heads, | ||
| tp_group=tp_group, | ||
| tp_size=tp_size, | ||
| sequence_parallel=True, | ||
| fuse_qkv_params=True, | ||
| ub_tp_comm_overlap=True, | ||
| ub_overlap_ag=True, | ||
| ub_overlap_rs=True, | ||
| ub_bulk_wgrad=True, | ||
| ub_bulk_dgrad=True, | ||
| seq_length=seq_length, | ||
| ) |
There was a problem hiding this comment.
Undefined variables in runnable code snippet
The sketch uses num_heads, head_dim, seq_length, and batch_size before they are ever assigned, so any user who copies this block verbatim will immediately hit NameError: name 'num_heads' is not defined. The actual example in te_layer_with_overlap.py derives these from parsed arguments (config.num_heads, config.head_dim, config.seq_length, config.batch_size). The README snippet should either define these variables with representative placeholder values (e.g. num_heads = 64; head_dim = 128; seq_length = 2048; batch_size = 2) or add a clear comment that they must be set by the caller before this block.
| te.module.base.initialize_ub( | ||
| [batched_size, hidden_size], | ||
| tp_size, | ||
| dtype=torch.bfloat16, | ||
| bootstrap_backend="nccl", | ||
| ) |
There was a problem hiding this comment.
The
initialize_ub call in the sketch omits the quantization_modes argument that the actual example passes. For FP8 workloads the default (None) differs from the FP8 mode used in te_layer_with_overlap.py. A brief comment noting that quantization_modes should be set for FP8 training would prevent silent misconfiguration.
| te.module.base.initialize_ub( | |
| [batched_size, hidden_size], | |
| tp_size, | |
| dtype=torch.bfloat16, | |
| bootstrap_backend="nccl", | |
| ) | |
| te.module.base.initialize_ub( | |
| [batched_size, hidden_size], | |
| tp_size, | |
| dtype=torch.bfloat16, | |
| # quantization_modes=[te.module.base.UserBufferQuantizationMode.FP8] # add for FP8 training | |
| bootstrap_backend="nccl", | |
| ) |
Signed-off-by: Omri SirComp <omribz156@gmail.com>
|
Thanks, that was a good catch. I pushed a follow-up that defines the dimensions used by the sketch and passes Verification:
This follow-up was Codex-assisted and manually checked before pushing. |
Description
Adds source-backed setup guidance to the
comm_gemm_overlapexample README so users can see the userbuffer initialization flow, the relationship betweenub_tp_comm_overlapand individual overlap flags, and the key Hugging Face replacement considerations.Fixes #2618
Type of change
Changes
initialize_ub, overlap-enabled layer construction, anddestroy_ubTransformerLayersetup sketch based on the existing example flowub_tp_comm_overlapgates the lower-level overlap flags onTransformerLayerChecklist:
Verification:
No runtime tests were run because this is a docs-only change for a distributed CUDA example.
This was prepared with Codex assistance, with the final docs reviewed manually against the existing example and
TransformerLayersource.