You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using any model with mamba attention, its required to set the flag `--mcp` with context parallel degree. Further, for hybrid models that use combination of mamba and SDPA attention should use both `--mcp` and `parallelism_config_cp_size` options both having the same cp degree value.
327
+
328
+
#### Enabling Context Parallel with Data Parallel
329
+
330
+
Context parallel can be combined with data parallel using the `parallelism_config_dp_shard_size` parameter.
331
+
332
+
```
333
+
compute_environment: LOCAL_MACHINE
334
+
distributed_type: FSDP
335
+
fsdp_config:
336
+
fsdp_version: "2" # turn on v2 of FSDP
337
+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
338
+
fsdp_backward_prefetch: BACKWARD_PRE
339
+
fsdp_backward_prefetch_policy: BACKWARD_PRE
340
+
fsdp_forward_prefetch: false
341
+
fsdp_offload_params: false
342
+
fsdp_sharding_strategy: FULL_SHARD
343
+
fsdp_state_dict_type: SHARDED_STATE_DICT
344
+
fsdp_cpu_ram_efficient_loading: true
345
+
fsdp_sync_module_states: true
346
+
fsdp_use_orig_params: true
347
+
use_parallelism_config: "true" # required to turn on parallelism feature
parallelism_config_dp_shard_size: 8 # data parallel degree
350
+
machine_rank: 0
351
+
num_machines: 1
352
+
num_processes: 8
353
+
rdzv_backend: static
354
+
same_network: true
355
+
```
356
+
357
+
To be noted that, context parallel degree multiplied by data parallel degree should be equal to the total number of GPUs being used.
358
+
359
+
#### Enabling Mixed Precision
360
+
361
+
Mixed precision has to be provided using `fsdp_mixed_precision_policy` parameter only. Do not use direct flags like `--bf16` or `mixed_precision` accelerate config parameter.
parallelism_config_dp_shard_size: 8 # data parallel degree
413
+
machine_rank: 0
414
+
num_machines: 1
415
+
num_processes: 8
416
+
rdzv_backend: static
417
+
same_network: true
418
+
```
419
+
420
+
#### Enabling Context Parallel with Data Parallel and Expert Parallel
421
+
422
+
For MoE models, expert parallel with MoE kernels can be enabled using the `--fast_moe` flag along with context and data parallelisms. The expert parallel degree is agnostic of context parallel degree. Therefore it can be used like described [here](./tuning-techniques.md#fms-acceleration).
423
+
424
+
### Recommendations
425
+
426
+
1. Keeping context parallelism within a node is usually optimal unless there is need for extremely long sequences like 256k. Given that, its optimal to choose the right cp degree in the multiple of 2 starting from 2 and upto 8.
427
+
2. Data parallel degree multiplied by context parallel degree should be equal to total number of GPUs being used.
428
+
3. Context parallel degree determinies number of chunks sequence has to be divided and distributed across GPUs, therefore it has to be choosen as minimium as needed to accommodate a sequence length.
429
+
430
+
### Known Limitations
431
+
432
+
1. load balancing is removed given limited support on mamba cp implementation. This could lead to potential throughput drops for trainings using causal mask.
433
+
2. Padding free and flash attention are not supported.
0 commit comments