Skip to content

refactor: optimize broadcast#850

Merged
Vinkle-hzt merged 1 commit into
alibaba:mainfrom
Vinkle-hzt:op_broadcast
Apr 8, 2026
Merged

refactor: optimize broadcast#850
Vinkle-hzt merged 1 commit into
alibaba:mainfrom
Vinkle-hzt:op_broadcast

Conversation

@Vinkle-hzt
Copy link
Copy Markdown
Collaborator

packed tensors so we can launch less broadcast kernel

@Vinkle-hzt Vinkle-hzt marked this pull request as ready for review April 1, 2026 09:38
@Vinkle-hzt Vinkle-hzt requested a review from LLLLKKKK as a code owner April 1, 2026 09:38
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

LLLLKKKK commented Apr 1, 2026

Code Review: PR #850 — refactor: optimize broadcast

Author: Vinkle-hzt
Review version: v1
Head SHA: da3a6d6
Verdict: LGTM

Summary

优化 tpSyncModelInputs 中的 NCCL broadcast:将 N 个独立 tensor 的逐个 broadcast 改为打包成最多 2 个 packed buffer(CPU pinned + GPU)后一次性 broadcast,减少 NCCL 调用次数。

核心变更:

  1. 收集所有参与 broadcast 的 tensor 指针(collect lambda)
  2. 按设备类型分类(CPU/GPU),计算 packed 大小
  3. Root rank 将各 tensor memcpy/copy_ 到 packed buffer
  4. 执行 1-2 次 broadcast(而非 ~20 次)
  5. Non-root rank 从 packed buffer 解包回各 tensor

评价

改动方向正确,减少 NCCL broadcast 调用次数对多 TP 场景有明显收益。代码结构清晰,pack/unpack 逻辑对称。

P2 Suggestions

1. GPU pack 使用 from_blob + copy_ 可能不如 cudaMemcpyAsync 高效

auto src_bytes = torch::from_blob(contig.data_ptr(), {e.nbytes}, ...);
gpu_packed.narrow(0, e.offset, e.nbytes).copy_(src_bytes);

每个 entry 创建一个临时 tensor view 并调用 copy_。如果 entry 数量多,可以考虑用单次 cudaMemcpyAsync 替代,或者预先将所有 GPU tensor 拼接到一个连续 buffer 中。不过当前实现正确性没问题。

2. contiguous() 调用可能产生不必要的拷贝

auto contig = e.tensor->contiguous();

如果 tensor 已经是 contiguous 的(大多数情况下应该是),contiguous() 是 no-op。但如果不是,会产生额外拷贝。考虑添加 assert 或 debug log 检测非 contiguous 的情况。

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

LLLLKKKK commented Apr 2, 2026

🤖 Code Review (v2 incremental) — LGTM

Verdict: LGTM ✅

核心增量:tpSyncModelInputs 从逐 tensor broadcast 改为 packed buffer broadcast。

优化效果: 原实现每个 tensor 单独 broadcast(N 次 NCCL launch),改后 pack 到 1-2 个 buffer(CPU pinned + GPU),只需 2 次 broadcast。

实现正确性:

  • 16 字节对齐确保 typed access 安全和 GPU coalescing
  • collect 使用 defined() && numel() > 0 过滤,所有 rank 收集顺序一致
  • CPU buffer 使用 pin_memory() 满足 NCCL 要求
  • DistributedComm.cc.to() 添加 non_blocking=true

Automated review by CI Bot

@Vinkle-hzt Vinkle-hzt enabled auto-merge (rebase) April 2, 2026 07:07
@Vinkle-hzt Vinkle-hzt merged commit d7118b9 into alibaba:main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants