Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions examples/pytorch/comm_gemm_overlap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,81 @@
- Devices older than compute capability 9.0 require `UB_SKIPMC=1` in the environment in order to fall
back on a less performant implementation based on CUDA Inter-Process Communication (IPC) handles.

## Enabling overlap in your own module

The example follows the same setup sequence that user code should use:

1. Set `CUDA_DEVICE_MAX_CONNECTIONS=1` before creating the layer.
2. Initialize `torch.distributed` and create the tensor-parallel process group.
3. Call `te.module.base.initialize_ub(...)` with the local activation shape and tensor-parallel
size before constructing TE layers with userbuffer overlap enabled.
4. Pass the tensor-parallel group, tensor-parallel size, and overlap flags to the TE layer.
5. Call `te.module.base.destroy_ub()` before shutting down the process group.

Minimal setup sketch:

```python
import os
import torch
import torch.distributed as dist
import transformer_engine.pytorch as te

os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "1"

dist.init_process_group(backend="nccl")
tp_group = dist.group.WORLD
tp_size = dist.get_world_size(tp_group)

num_heads = 16
head_dim = 128
seq_length = 2048
micro_batch_size = 4

hidden_size = num_heads * head_dim
batched_size = seq_length * micro_batch_size

te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
quantization_modes=[te.module.base.UserBufferQuantizationMode.NONE],
dtype=torch.bfloat16,
bootstrap_backend="nccl",
)
Comment on lines +45 to +51
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The initialize_ub call in the sketch omits the quantization_modes argument that the actual example passes. For FP8 workloads the default (None) differs from the FP8 mode used in te_layer_with_overlap.py. A brief comment noting that quantization_modes should be set for FP8 training would prevent silent misconfiguration.

Suggested change
te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
dtype=torch.bfloat16,
bootstrap_backend="nccl",
)
te.module.base.initialize_ub(
[batched_size, hidden_size],
tp_size,
dtype=torch.bfloat16,
# quantization_modes=[te.module.base.UserBufferQuantizationMode.FP8] # add for FP8 training
bootstrap_backend="nccl",
)


layer = te.TransformerLayer(
hidden_size,
4 * hidden_size,
num_heads,
tp_group=tp_group,
tp_size=tp_size,
sequence_parallel=True,
fuse_qkv_params=True,
ub_tp_comm_overlap=True,
ub_overlap_ag=True,
ub_overlap_rs=True,
ub_bulk_wgrad=True,
ub_bulk_dgrad=True,
seq_length=seq_length,
)
Comment on lines +42 to +67
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undefined variables in runnable code snippet

The sketch uses num_heads, head_dim, seq_length, and batch_size before they are ever assigned, so any user who copies this block verbatim will immediately hit NameError: name 'num_heads' is not defined. The actual example in te_layer_with_overlap.py derives these from parsed arguments (config.num_heads, config.head_dim, config.seq_length, config.batch_size). The README snippet should either define these variables with representative placeholder values (e.g. num_heads = 64; head_dim = 128; seq_length = 2048; batch_size = 2) or add a clear comment that they must be set by the caller before this block.


# ... run forward/backward/optimizer steps ...

te.module.base.destroy_ub()
```

`ub_tp_comm_overlap` is the top-level gate on `TransformerLayer`: when it is `False`, the
layer disables the individual userbuffer overlap paths even if the per-path flags are `True`.
For lower-level layers such as `Linear`, `LayerNormLinear`, `LayerNormMLP`, or
`MultiheadAttention`, enable the relevant per-path flags directly (for example
`ub_overlap_ag`, `ub_overlap_rs`, `ub_bulk_wgrad`, and `ub_bulk_dgrad`) and set the `ub_name`
where the layer requires one.

When replacing modules in a Hugging Face model, run the userbuffer initialization once before
constructing the replacement TE modules. The replacement modules need the same tensor-parallel
group, tensor-parallel size, sequence-parallel setting, and overlap flags shown above; the
activation shape passed to `initialize_ub` should match the sequence length, micro-batch size,
and hidden size used by the replaced blocks.

## Examples

### Single node, tensor-parallel LayerNormMLP:
Expand Down