Commit 2fef374
fix: auto-compute dp_replicate_size from world_size (#1302)
## Summary
- When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs
across 2 nodes), `ParallelismConfig` raises `total_size (4) does not
match num_processes (8)` because `dp_replicate_size` defaults to 1
- Auto-compute `dp_replicate_size = world_size // (dp_shard_size *
cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel
replication works without manual config
- This enables `dp_shard_size` to be set to per-node GPU count (better
NVLink utilization) while automatically creating replicas across nodes
## Test plan
- [ ] Verify single-node training (dp_shard_size == world_size,
dp_replicate_size == 1) unchanged
- [ ] Verify multi-node with dp_shard_size < world_size creates correct
replica groups
- [ ] Verify existing EAGLE3/DFlash configs still work
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Enhanced parallelism configuration initialization in the speculative
decoding example to better handle distributed training scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 355c6b7 commit 2fef374
1 file changed
Lines changed: 16 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
212 | 212 | | |
213 | 213 | | |
214 | 214 | | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
215 | 228 | | |
216 | | - | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
217 | 232 | | |
218 | 233 | | |
219 | 234 | | |
| |||
0 commit comments