diff --git a/docs/guides/optimization/sharding.md b/docs/guides/optimization/sharding.md index 7176c18613..1dfb685bdf 100644 --- a/docs/guides/optimization/sharding.md +++ b/docs/guides/optimization/sharding.md @@ -47,11 +47,11 @@ Explanation: Both the activations ($BM$) and weights ($ME$) are sharded on the M | $M$ | mlp_dim (aka intermediate dim) | | $X$ | expert | -Note for the feed forward computation the batch and sequence dimensions act the same and thus we use only one $B$ axis (which you can think of as a token batch dimension, a reshaping of batch and sequence into one axis), but for context and sequence parallelism they act differently and thus we use both a $B$ and $S$ dimension and the $B$ dimension is really batch in sequences. For example a matmul with an explicit sequence dimension might look like +Note for the feed forward computation the batch and sequence dimensions act the same and thus we use only one $B$ axis (which you can think of as a token batch dimension, a reshaping of batch and sequence into one axis), but for context parallelism they act differently and thus we use both a $B$ and $S$ dimension and the $B$ dimension is really batch in sequences. For example a matmul with an explicit sequence dimension might look like $$BSE \times EM = BSM$$ -But for arithmetic intensity roofline analysis purposes the $B$ and $S$ axis act as one, and generally we omit the $S$ axis except for when its needed (context/sequence parallelism), thus we only write +But for arithmetic intensity roofline analysis purposes the $B$ and $S$ axis act as one, and generally we omit the $S$ axis except for when its needed (context parallelism), thus we only write $$BE \times EM = BM$$ @@ -113,7 +113,7 @@ $$B_yM_x \times M_xE = B_yE \rightarrow \text{RS over x } \rightarrow B_yE_x $$ **Ratio (Arithmetic Intensity)** = $|M_x|$ Flops/byte -This "independence" of sharding strategies is true for the main four parallelisms (data, model (tensor), pipeline, and expert). Note that data, fsdp, context and sequence parallelism are all roughly the same for the purpose of +This "independence" of sharding strategies is true for the main four parallelisms (data, model (tensor), pipeline, and expert). Note that data, fsdp, and context parallelism are all roughly the same for the purpose of arithmetic intensity analysis since they shard the batch, as we will illustrate in the individual sections below. In addition both data and pipeline parallelism (microbatches) shard the batch which decreases the HBM arithmetic intensity. ## Code implementation of sharding in MaxText @@ -270,28 +270,6 @@ The extra cost of all gathering of keys and values is small, especially for long **Ratio**: `seq_len * query_heads / (kv_heads * |CP|)` -## Sequence Parallelism (SP) - -Sequence parallelism is very similar to context parallelism - we shard the layer inputs and feed forward activations along the sequence dimension. The difference is for attention - we shard the queries, keys, and values along the head dimension instead of sequence dimension (this is fairly MaxText specific, you might not see this in other codebases). This is because the head dimension is easy to shard on for attention (it is not a contracting dimension), and thus can be more efficient than context parallelism as long as there are enough heads. Both sequence parallelism and tensor parallelism shard the heads, so we are constrained by `tensor_parallelism * sequence_parallelism < kv_heads`. E.g. if there are only 8 `kv_heads` as for llama3 and we use `tensor_parallelism=8`, then we cannot use any `sequence_parallelism` (e.g. `sequence_parallelism=1`) - -Sequence parallelism is currently only supported with TPUs attention kernel, for GPUs we recommend context parallelism above. - -### SP Arithmetic Intensity - -The main communications are the same as `FSDP` (all gather weights and synchronize gradients), with an arithmetic intensity of `local_batch` / `sparsity` - -#### SP Extra A2A cost - -Sequence parallelism has an additional cost of transferring the sharding from sequence to heads (and back again) for attention. This is executed via and all-to-all which are generally cheap operations, analyzed below: - -**Compute**: Attention ($4 * batch * seq_len^2 * heads * head_dim / |SP|$) - -**Communicate:** A2A QKV activations and output activations (roughly $4 * batch * seq_len * heads * head_dim$) - -**Ratio (Arithmetic Intensity)**: Proportional to $seq_len / |SP|$ - -The exact ratio depends on MHA vs GQA, how many kv heads there are and the efficiency of an all-to-all on the given hardware. - ## Tensor Parallelism (TP) Shard the activations along the feature dimensions (e.g. model or `embed` dimension and intermediate or `mlp` dimension) instead of the batch dimension. Tensor parallelism communicates the activations as opposed to the weights as in DP/FSDP. Tensor parallelism can be used to replace some amount of DP/FSDP when the batch size is small and/or when the model is large (when the `mlp` dim is large). Tensor parallelism is needed to run with small batches, such as fraction `per_device_batch_size` < 1. For instance if we use `TP=4` then we can use the rest with FSDP and set `per_device_batch_size=0.25` since the `global_batch = per_device_batch_size * TP * FSDP = 0.25 * 4 * FSDP = FSDP`, and this is shardable among `FSDP` devices (each device will get a shard of `FSDP/FSDP = 1` of the batch axis in this case). For the attention activations (query, key, value), we shard the heads on `TP` since that is the easiest dimension to shard on and use an attention kernel like flash attention (the heads are not a contracting dimension during the attention computation). diff --git a/src/maxtext/configs/base.yml b/src/maxtext/configs/base.yml index c88fb4767c..39ef5b2dee 100644 --- a/src/maxtext/configs/base.yml +++ b/src/maxtext/configs/base.yml @@ -445,29 +445,27 @@ compile_xla_flags: "" # Compiler options e.g. compile_xla_flags="--xla_tpu_num_s # Parallelism shard_mode: "auto" # can be either auto or explicit custom_mesh_and_rule: "" # replace default mesh and logical rule by specifying yml name under config/mesh_and_rule/. -mesh_axes: ['diloco', 'data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive'] +mesh_axes: ['diloco', 'data', 'stage', 'fsdp', 'fsdp_transpose', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive'] logical_axis_rules: [ # ========================================== # Vocabulary Embedding # ========================================== # Vocab Activations ['activation_embed_and_logits_batch', ['data', 'stage', 'fsdp', 'fsdp_transpose', 'expert']], - ['activation_embed_and_logits_batch_sequence', ['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'expert']], + ['activation_embed_and_logits_batch_sequence', ['data', 'stage', 'fsdp', 'fsdp_transpose', 'context', 'expert']], ['activation_vocab', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_vocab', ['tensor', 'tensor_transpose']], ['activation_vocab', 'tensor_sequence'], - ['activation_vocab', ['sequence', 'context']], # Vocab Weights ['vocab', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], - ['embed_vocab', ['fsdp', 'fsdp_transpose', 'sequence', 'context', 'expert']], + ['embed_vocab', ['fsdp', 'fsdp_transpose', 'context', 'expert']], # ========================================== # Attention # ========================================== # Attention Activations ['activation_batch_attn', ['data', 'fsdp', 'fsdp_transpose', 'expert']], - ['activation_heads', ['tensor', 'tensor_transpose', 'sequence', 'tensor_sequence', 'autoregressive']], - ['activation_kv_heads', ['tensor', 'tensor_transpose', 'sequence', 'tensor_sequence']], - ['activation_length_attn', ['sequence', 'context']], + ['activation_heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], + ['activation_kv_heads', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_length_attn', ['context']], ['activation_q_length', ['context']], ['activation_kv_length', []], @@ -482,34 +480,33 @@ logical_axis_rules: [ ['qkv', []], ['kv', []], ['kv_head_dim', []], - ['q_lora', ['fsdp', 'fsdp_transpose', 'sequence', 'context', 'tensor_transpose', 'expert']], - ['q_lora', ['fsdp', 'sequence', 'context', 'tensor_transpose', 'expert']], - ['q_lora', ['fsdp', 'fsdp_transpose', 'sequence', 'context', 'expert']], - ['q_lora', ['fsdp', 'sequence', 'context', 'expert']], + ['q_lora', ['fsdp', 'fsdp_transpose', 'context', 'tensor_transpose', 'expert']], + ['q_lora', ['fsdp', 'context', 'tensor_transpose', 'expert']], + ['q_lora', ['fsdp', 'fsdp_transpose', 'context', 'expert']], + ['q_lora', ['fsdp', 'context', 'expert']], ["q_lora_up_proj", []], - ['kv_lora', ['fsdp', 'fsdp_transpose', 'sequence', 'context', 'tensor_transpose', 'expert']], - ['kv_lora', ['fsdp', 'sequence', 'context', 'tensor_transpose', 'expert']], - ['kv_lora', ['fsdp', 'fsdp_transpose', 'sequence', 'context', 'expert']], - ['kv_lora', ['fsdp', 'sequence', 'context', 'expert']], + ['kv_lora', ['fsdp', 'fsdp_transpose', 'context', 'tensor_transpose', 'expert']], + ['kv_lora', ['fsdp', 'context', 'tensor_transpose', 'expert']], + ['kv_lora', ['fsdp', 'fsdp_transpose', 'context', 'expert']], + ['kv_lora', ['fsdp', 'context', 'expert']], ["kv_lora_up_proj", []], # ========================================== # Mixture of Experts (MoE) # ========================================== # MoE Activations ['activation_batch_moe', ['data', 'fsdp', 'fsdp_transpose']], - ['activation_length_moe', ['sequence', 'context']], ['activation_length_moe', ['context']], - ['activation_norm_length_moe', ['tensor_sequence', 'context', 'sequence']], + ['activation_norm_length_moe', ['tensor_sequence', 'context']], ['activation_embed_moe', ['tensor', 'tensor_transpose']], ['activation_mlp_moe', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_exp', ['expert']], # MoE Weights ['exp', 'expert'], ['mlp_moe', ['fsdp_transpose', 'tensor', 'tensor_sequence', 'autoregressive']], - ['embed_moe', ['fsdp', 'fsdp_transpose', 'sequence', 'tensor_transpose', 'context']], - ['embed_moe', ['fsdp', 'sequence', 'tensor_transpose', 'context']], - ['embed_moe', ['fsdp', 'fsdp_transpose', 'sequence', 'context']], - ['embed_moe', ['fsdp', 'sequence', 'context']], + ['embed_moe', ['fsdp', 'fsdp_transpose', 'tensor_transpose', 'context']], + ['embed_moe', ['fsdp', 'tensor_transpose', 'context']], + ['embed_moe', ['fsdp', 'fsdp_transpose', 'context']], + ['embed_moe', ['fsdp', 'context']], # ========================================== # Standard MLP / Dense Layers / Model Structure # ========================================== @@ -517,17 +514,16 @@ logical_axis_rules: [ ['activation_mlp', ['tensor', 'tensor_transpose', 'tensor_sequence']], # Note activation batch and length also get used in vocab ['activation_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], - ['activation_length', ['sequence', 'context']], ['activation_length', ['context']], - ['activation_norm_length', ['tensor_sequence', 'context', 'sequence']], + ['activation_norm_length', ['tensor_sequence', 'context']], ['activation_embed', ['tensor', 'tensor_transpose']], ['activation_stage', 'stage'], # General Weights ['mlp', ['fsdp_transpose', 'tensor', 'tensor_sequence', 'autoregressive']], - ['embed', ['fsdp', 'fsdp_transpose', 'sequence', 'tensor_transpose', 'context', 'expert']], - ['embed', ['fsdp', 'sequence', 'tensor_transpose', 'context', 'expert']], - ['embed', ['fsdp', 'fsdp_transpose', 'sequence', 'context', 'expert']], - ['embed', ['fsdp', 'sequence', 'context', 'expert']], + ['embed', ['fsdp', 'fsdp_transpose', 'tensor_transpose', 'context', 'expert']], + ['embed', ['fsdp', 'tensor_transpose', 'context', 'expert']], + ['embed', ['fsdp', 'fsdp_transpose', 'context', 'expert']], + ['embed', ['fsdp', 'context', 'expert']], ['norm', ['tensor', 'tensor_transpose']], ['layers', 'stage'], ['diloco', 'diloco'], @@ -538,11 +534,11 @@ logical_axis_rules: [ # ========================================== # Inference(Prefill, Decode, Cache) # ========================================== - ['prefill_activation_length', ['sequence', 'context']], - ['prefill_activation_norm_length', ['tensor_sequence', 'context', 'sequence']], + ['prefill_activation_length', ['context']], + ['prefill_activation_norm_length', ['tensor_sequence', 'context']], ['activation_prefill_kv_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], ['decode_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], - ['decode_length', ['sequence']], + ['decode_length', []], ['cache_heads', ['autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence']], ['cache_heads', ['autoregressive', 'tensor', 'tensor_sequence']], ['paged_kv_heads', ['tensor']], @@ -562,7 +558,7 @@ logical_axis_rules: [ ['exp_with_fsdp', 'fsdp'], ] # Axes used for DCN must be earlier in this list than ICI, see (b/339009148) for details -data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] +data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] input_data_sharding_logical_axes: ['activation_embed_and_logits_batch', 'activation_norm_length'] # Determines which physical axis plays the role of context parallelism for input data processing and load balancing # only supports "context" or "expert" (when custom_mesh_and_rule=ep-as-cp) diff --git a/src/maxtext/configs/inference/inference.yml b/src/maxtext/configs/inference/inference.yml index 22c020f091..7a263cc282 100644 --- a/src/maxtext/configs/inference/inference.yml +++ b/src/maxtext/configs/inference/inference.yml @@ -3,13 +3,13 @@ base_config: "base.yml" logical_axis_rules: [ ['activation_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], ['activation_embed_and_logits_batch', ['data', 'stage', 'fsdp', 'fsdp_transpose', 'expert']], - ['activation_heads', ['tensor', 'tensor_transpose', 'sequence','tensor_sequence']], - ['activation_kv_heads', ['tensor', 'tensor_transpose', 'sequence','tensor_sequence']], - ['activation_length', ['context_autoregressive', 'sequence']], + ['activation_heads', ['tensor', 'tensor_transpose', 'tensor_sequence']], + ['activation_kv_heads', ['tensor', 'tensor_transpose', 'tensor_sequence']], + ['activation_length', ['context_autoregressive']], ['activation_length', ['context_autoregressive']], ['activation_q_length', ['context_autoregressive']], ['activation_kv_length', ['context_autoregressive']], - ['activation_norm_length', ['tensor_sequence', 'sequence']], + ['activation_norm_length', ['tensor_sequence']], ['activation_embed', ['tensor_transpose']], ['activation_mlp', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_mlp_moe', ['tensor', 'tensor_transpose', 'tensor_sequence']], @@ -17,10 +17,10 @@ logical_axis_rules: [ ['activation_prefill_kv_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], ['activation_kv_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert', 'context_autoregressive']], ['activation_kv_head_dim', ['tensor', 'tensor_transpose', 'tensor_sequence']], - ['activation_vocab', ['tensor', 'tensor_transpose', 'sequence', 'tensor_sequence']], + ['activation_vocab', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_vocab', ['tensor', 'tensor_transpose']], ['activation_vocab', 'tensor_sequence'], - ['activation_vocab', ['sequence', 'context_autoregressive']], + ['activation_vocab', ['context_autoregressive']], ['activation_stage', 'stage'], ['activation_exp', ['expert', 'context_autoregressive']], ['decode_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert', 'context_autoregressive']], @@ -32,18 +32,18 @@ logical_axis_rules: [ ['heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], ['q_heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], ['kv_heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], - ['embed', ['fsdp', 'fsdp_transpose', 'sequence', 'tensor_transpose', 'expert']], - ['embed', ['fsdp', 'sequence', 'tensor_transpose', 'expert']], - ['embed', ['fsdp', 'fsdp_transpose', 'sequence', 'expert']], - ['embed', ['fsdp', 'sequence', 'expert']], - ['embed_vocab', ['fsdp', 'fsdp_transpose', 'sequence', 'tensor_transpose', 'expert']], - ['embed_vocab', ['fsdp', 'sequence', 'tensor_transpose', 'expert']], - ['embed_vocab', ['fsdp', 'fsdp_transpose', 'sequence', 'expert']], - ['embed_vocab', ['fsdp', 'sequence', 'expert']], - ['embed_moe', ['fsdp', 'fsdp_transpose', 'sequence', 'context_autoregressive', 'tensor_transpose']], - ['embed_moe', ['fsdp', 'sequence', 'context_autoregressive', 'tensor_transpose']], - ['embed_moe', ['fsdp', 'fsdp_transpose', 'sequence', 'context_autoregressive']], - ['embed_moe', ['fsdp', 'sequence', 'context_autoregressive']], + ['embed', ['fsdp', 'fsdp_transpose', 'tensor_transpose', 'expert']], + ['embed', ['fsdp', 'tensor_transpose', 'expert']], + ['embed', ['fsdp', 'fsdp_transpose', 'expert']], + ['embed', ['fsdp', 'expert']], + ['embed_vocab', ['fsdp', 'fsdp_transpose', 'tensor_transpose', 'expert']], + ['embed_vocab', ['fsdp', 'tensor_transpose', 'expert']], + ['embed_vocab', ['fsdp', 'fsdp_transpose', 'expert']], + ['embed_vocab', ['fsdp', 'expert']], + ['embed_moe', ['fsdp', 'fsdp_transpose', 'context_autoregressive', 'tensor_transpose']], + ['embed_moe', ['fsdp', 'context_autoregressive', 'tensor_transpose']], + ['embed_moe', ['fsdp', 'fsdp_transpose', 'context_autoregressive']], + ['embed_moe', ['fsdp', 'context_autoregressive']], ['norm', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['layers', 'stage'], ['kv', []], @@ -62,4 +62,4 @@ logical_axis_rules: [ ['paged_kv_head_dim_size', []], ] # Axes used for DCN must be earlier in this list than ICI, see (b/339009148) for details -data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] \ No newline at end of file +data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] \ No newline at end of file diff --git a/src/maxtext/configs/post_train/rl_mt_jt.yml b/src/maxtext/configs/post_train/rl_mt_jt.yml index 34829fbc19..e9d5108e23 100644 --- a/src/maxtext/configs/post_train/rl_mt_jt.yml +++ b/src/maxtext/configs/post_train/rl_mt_jt.yml @@ -19,23 +19,23 @@ logical_axis_rules: [ ['prefill_activation_norm_length', ['data']], ['activation_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], ['activation_embed_and_logits_batch', ['data', 'stage', 'fsdp', 'fsdp_transpose', 'expert']], - ['activation_heads', ['tensor', 'tensor_transpose', 'sequence','tensor_sequence']], - ['activation_kv_heads', ['tensor', 'tensor_transpose', 'sequence','tensor_sequence']], - ['activation_length', ['context_autoregressive', 'sequence']], + ['activation_heads', ['tensor', 'tensor_transpose','tensor_sequence']], + ['activation_kv_heads', ['tensor', 'tensor_transpose','tensor_sequence']], + ['activation_length', ['context_autoregressive']], ['activation_length', ['context_autoregressive']], ['activation_q_length', ['context_autoregressive']], ['activation_kv_length', ['context_autoregressive']], - ['activation_norm_length', ['tensor_sequence', 'sequence']], + ['activation_norm_length', ['tensor_sequence']], ['activation_embed', ['tensor_transpose']], ['activation_mlp', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_kv', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_prefill_kv_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert']], ['activation_kv_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert', 'context_autoregressive']], ['activation_kv_head_dim', ['tensor', 'tensor_transpose', 'tensor_sequence']], - ['activation_vocab', ['tensor', 'tensor_transpose', 'sequence', 'tensor_sequence']], + ['activation_vocab', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['activation_vocab', ['tensor', 'tensor_transpose']], ['activation_vocab', 'tensor_sequence'], - ['activation_vocab', ['sequence', 'context_autoregressive']], + ['activation_vocab', ['context_autoregressive']], ['activation_stage', 'stage'], ['activation_exp', ['expert', 'context_autoregressive']], ['decode_batch', ['data', 'fsdp', 'fsdp_transpose', 'expert', 'context_autoregressive']], @@ -46,14 +46,14 @@ logical_axis_rules: [ ['heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], ['q_heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], ['kv_heads', ['tensor', 'tensor_transpose', 'tensor_sequence', 'autoregressive']], - ['embed', ['fsdp', 'fsdp_transpose', 'sequence', 'tensor_transpose', 'expert']], - ['embed', ['fsdp', 'sequence', 'tensor_transpose', 'expert']], - ['embed', ['fsdp', 'fsdp_transpose', 'sequence', 'expert']], - ['embed', ['fsdp', 'sequence', 'expert']], - ['embed_moe', ['fsdp', 'fsdp_transpose', 'sequence', 'context_autoregressive', 'tensor_transpose']], - ['embed_moe', ['fsdp', 'sequence', 'context_autoregressive', 'tensor_transpose']], - ['embed_moe', ['fsdp', 'fsdp_transpose', 'sequence', 'context_autoregressive']], - ['embed_moe', ['fsdp', 'sequence', 'context_autoregressive']], + ['embed', ['fsdp', 'fsdp_transpose', 'tensor_transpose', 'expert']], + ['embed', ['fsdp', 'tensor_transpose', 'expert']], + ['embed', ['fsdp', 'fsdp_transpose', 'expert']], + ['embed', ['fsdp', 'expert']], + ['embed_moe', ['fsdp', 'fsdp_transpose', 'context_autoregressive', 'tensor_transpose']], + ['embed_moe', ['fsdp', 'context_autoregressive', 'tensor_transpose']], + ['embed_moe', ['fsdp', 'fsdp_transpose', 'context_autoregressive']], + ['embed_moe', ['fsdp', 'context_autoregressive']], ['norm', ['tensor', 'tensor_transpose', 'tensor_sequence']], ['layers', 'stage'], ['kv', []], @@ -72,6 +72,6 @@ logical_axis_rules: [ ['paged_kv_head_dim_size', []], ] # Axes used for DCN must be earlier in this list than ICI, see (b/339009148) for details -data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] +data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] return_log_prob: True \ No newline at end of file diff --git a/tests/unit/train_compile_test.py b/tests/unit/train_compile_test.py index 4230c46174..09098b8f0b 100644 --- a/tests/unit/train_compile_test.py +++ b/tests/unit/train_compile_test.py @@ -182,25 +182,6 @@ def test_save_compiled_tpu7x_two_slices(self): ) ) - @pytest.mark.cpu_only - def test_sequence_parallelism(self): - temp_dir = gettempdir() - compiled_trainstep_file = os.path.join(temp_dir, "test_compiled.pickle") - train_compile_main( - ( - "", - get_test_config_path(), - f"compiled_trainstep_file={compiled_trainstep_file}", - "compile_topology=v5p-64", - "use_iota_embed=true", - "compile_topology_num_slices=1", - "ici_sequence_parallelism=16", - "global_parameter_scale=32", - "per_device_batch_size=0.0625", - "max_target_length=65536", - ) - ) - @pytest.mark.cpu_only def test_remat_save_dot_except_mlpwi(self): temp_dir = gettempdir() @@ -305,7 +286,7 @@ def test_custom_64x4_mesh(self): "compile_topology=v6e-256", "use_iota_embed=true", "compile_topology_num_slices=1", - "ici_sequence_parallelism=4", + "ici_context_parallelism=4", "global_parameter_scale=32", "per_device_batch_size=0.25", "max_target_length=65536", diff --git a/tests/utils/sharding_info/deepseek2-16b/tpu7x-16/slice_1/rule_default/named_shardings.json b/tests/utils/sharding_info/deepseek2-16b/tpu7x-16/slice_1/rule_default/named_shardings.json index 2c9e622ff4..4549ab46c5 100644 --- a/tests/utils/sharding_info/deepseek2-16b/tpu7x-16/slice_1/rule_default/named_shardings.json +++ b/tests/utils/sharding_info/deepseek2-16b/tpu7x-16/slice_1/rule_default/named_shardings.json @@ -7,7 +7,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -22,7 +21,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -43,7 +41,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -58,7 +55,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -86,7 +82,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -101,7 +96,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -114,7 +108,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -141,7 +134,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -156,7 +148,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -169,7 +160,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -196,7 +186,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -211,7 +200,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -231,7 +219,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -251,7 +238,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -266,7 +252,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -296,7 +281,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -311,7 +295,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -341,7 +324,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -356,7 +338,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -386,7 +367,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -401,7 +381,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -423,7 +402,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -443,7 +421,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -458,7 +435,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -472,7 +448,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -500,7 +475,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -515,7 +489,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -529,7 +502,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -551,7 +523,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -566,7 +537,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -580,7 +550,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -608,7 +577,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -623,7 +591,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -637,7 +604,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -661,7 +627,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -676,7 +641,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -690,7 +654,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context" ], @@ -711,7 +674,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -726,7 +688,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -741,7 +702,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -767,7 +727,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -782,7 +741,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -797,7 +755,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -823,7 +780,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -838,7 +794,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -859,7 +814,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -879,7 +833,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -894,7 +847,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -907,7 +859,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -934,7 +885,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -949,7 +899,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -962,7 +911,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -989,7 +937,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1004,7 +951,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1024,7 +970,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1044,7 +989,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1059,7 +1003,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1089,7 +1032,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1104,7 +1046,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1134,7 +1075,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1149,7 +1089,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1179,7 +1118,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1194,7 +1132,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1216,7 +1153,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1236,7 +1172,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1251,7 +1186,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1265,7 +1199,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1293,7 +1226,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1308,7 +1240,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1322,7 +1253,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1344,7 +1274,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1359,7 +1288,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1373,7 +1301,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1401,7 +1328,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1416,7 +1342,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1436,7 +1361,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1454,7 +1378,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1469,7 +1392,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1490,7 +1412,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1505,7 +1426,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1533,7 +1453,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1548,7 +1467,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1561,7 +1479,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1588,7 +1505,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1603,7 +1519,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1616,7 +1531,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1643,7 +1557,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1658,7 +1571,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1678,7 +1590,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1698,7 +1609,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1713,7 +1623,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1743,7 +1652,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1758,7 +1666,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1788,7 +1695,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1803,7 +1709,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1833,7 +1738,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1848,7 +1752,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1870,7 +1773,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1890,7 +1792,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1905,7 +1806,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1919,7 +1819,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1947,7 +1846,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1962,7 +1860,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1976,7 +1873,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1998,7 +1894,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2013,7 +1908,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2027,7 +1921,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2055,7 +1948,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2070,7 +1962,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2084,7 +1975,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2108,7 +1998,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2123,7 +2012,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2137,7 +2025,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context" ], @@ -2158,7 +2045,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2173,7 +2059,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2188,7 +2073,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2214,7 +2098,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2229,7 +2112,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2244,7 +2126,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2270,7 +2151,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2285,7 +2165,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2306,7 +2185,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -2326,7 +2204,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2341,7 +2218,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2354,7 +2230,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2381,7 +2256,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2396,7 +2270,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2409,7 +2282,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2436,7 +2308,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2451,7 +2322,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2471,7 +2341,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2491,7 +2360,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2506,7 +2374,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2536,7 +2403,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2551,7 +2417,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2581,7 +2446,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2596,7 +2460,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2626,7 +2489,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2641,7 +2503,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2663,7 +2524,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2683,7 +2543,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2698,7 +2557,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2712,7 +2570,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2740,7 +2597,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2755,7 +2611,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2769,7 +2624,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -2791,7 +2645,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2806,7 +2659,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2820,7 +2672,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2848,7 +2699,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2863,7 +2713,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2883,7 +2732,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2901,7 +2749,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2916,7 +2763,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2944,7 +2790,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2959,7 +2804,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2972,7 +2816,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2999,7 +2842,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3014,7 +2856,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3027,7 +2868,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3054,7 +2894,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3069,7 +2908,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3089,7 +2927,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3109,7 +2946,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3124,7 +2960,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3154,7 +2989,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3169,7 +3003,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3199,7 +3032,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3214,7 +3046,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3244,7 +3075,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3259,7 +3089,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3281,7 +3110,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -3301,7 +3129,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3316,7 +3143,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3330,7 +3156,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3358,7 +3183,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3373,7 +3197,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3387,7 +3210,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -3409,7 +3231,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3424,7 +3245,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3438,7 +3258,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3466,7 +3285,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3481,7 +3299,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3495,7 +3312,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3519,7 +3335,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3534,7 +3349,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3548,7 +3362,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context" ], @@ -3569,7 +3382,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3584,7 +3396,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3599,7 +3410,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3625,7 +3435,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3640,7 +3449,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3655,7 +3463,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3681,7 +3488,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3696,7 +3502,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3717,7 +3522,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -3737,7 +3541,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3752,7 +3555,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3765,7 +3567,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3792,7 +3593,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3807,7 +3607,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3820,7 +3619,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3847,7 +3645,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3862,7 +3659,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3882,7 +3678,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3902,7 +3697,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3917,7 +3711,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3947,7 +3740,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3962,7 +3754,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3992,7 +3783,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4007,7 +3797,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4037,7 +3826,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4052,7 +3840,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4074,7 +3861,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4094,7 +3880,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4109,7 +3894,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4123,7 +3907,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4151,7 +3934,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4166,7 +3948,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4180,7 +3961,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -4202,7 +3982,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4217,7 +3996,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4231,7 +4009,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4259,7 +4036,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4274,7 +4050,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4294,7 +4069,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4312,7 +4086,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4327,7 +4100,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, diff --git a/tests/utils/sharding_info/deepseek2-16b/v6e-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=4/named_shardings.json b/tests/utils/sharding_info/deepseek2-16b/v6e-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=4/named_shardings.json index d62417df12..ca994b50f6 100644 --- a/tests/utils/sharding_info/deepseek2-16b/v6e-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=4/named_shardings.json +++ b/tests/utils/sharding_info/deepseek2-16b/v6e-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=4/named_shardings.json @@ -7,7 +7,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -22,7 +21,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -43,7 +41,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -58,7 +55,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -86,7 +82,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -101,7 +96,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -114,7 +108,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -141,7 +134,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -156,7 +148,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -169,7 +160,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -196,7 +186,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -211,7 +200,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -231,7 +219,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -251,7 +238,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -266,7 +252,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -296,7 +281,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -311,7 +295,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -341,7 +324,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -356,7 +338,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -386,7 +367,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -401,7 +381,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -423,7 +402,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -443,7 +421,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -458,7 +435,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -472,7 +448,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -500,7 +475,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -515,7 +489,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -529,7 +502,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -551,7 +523,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -566,7 +537,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -580,7 +550,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -608,7 +577,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -623,7 +591,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -637,7 +604,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -661,7 +627,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -676,7 +641,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -690,7 +654,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context" ], @@ -711,7 +674,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -726,7 +688,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -741,7 +702,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -767,7 +727,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -782,7 +741,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -797,7 +755,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -823,7 +780,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -838,7 +794,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -859,7 +814,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -879,7 +833,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -894,7 +847,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -907,7 +859,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -934,7 +885,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -949,7 +899,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -962,7 +911,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -989,7 +937,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1004,7 +951,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1024,7 +970,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1044,7 +989,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1059,7 +1003,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1089,7 +1032,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1104,7 +1046,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1134,7 +1075,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1149,7 +1089,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1179,7 +1118,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1194,7 +1132,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1216,7 +1153,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1236,7 +1172,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1251,7 +1186,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1265,7 +1199,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1293,7 +1226,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1308,7 +1240,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1322,7 +1253,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1344,7 +1274,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1359,7 +1288,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1373,7 +1301,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1401,7 +1328,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1416,7 +1342,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1436,7 +1361,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1454,7 +1378,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1469,7 +1392,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1490,7 +1412,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1505,7 +1426,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1533,7 +1453,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1548,7 +1467,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1561,7 +1479,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1588,7 +1505,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1603,7 +1519,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1616,7 +1531,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1643,7 +1557,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1658,7 +1571,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1678,7 +1590,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1698,7 +1609,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1713,7 +1623,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1743,7 +1652,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1758,7 +1666,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1788,7 +1695,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1803,7 +1709,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1833,7 +1738,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1848,7 +1752,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1870,7 +1773,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1890,7 +1792,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1905,7 +1806,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1919,7 +1819,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1947,7 +1846,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1962,7 +1860,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1976,7 +1873,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1998,7 +1894,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2013,7 +1908,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2027,7 +1921,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2055,7 +1948,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2070,7 +1962,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2084,7 +1975,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2108,7 +1998,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2123,7 +2012,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2137,7 +2025,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context" ], @@ -2158,7 +2045,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2173,7 +2059,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2188,7 +2073,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2214,7 +2098,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2229,7 +2112,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2244,7 +2126,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2270,7 +2151,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2285,7 +2165,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2306,7 +2185,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -2326,7 +2204,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2341,7 +2218,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2354,7 +2230,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2381,7 +2256,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2396,7 +2270,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2409,7 +2282,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2436,7 +2308,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2451,7 +2322,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2471,7 +2341,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2491,7 +2360,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2506,7 +2374,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2536,7 +2403,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2551,7 +2417,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2581,7 +2446,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2596,7 +2460,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2626,7 +2489,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2641,7 +2503,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2663,7 +2524,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2683,7 +2543,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2698,7 +2557,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2712,7 +2570,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2740,7 +2597,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2755,7 +2611,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2769,7 +2624,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -2791,7 +2645,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2806,7 +2659,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2820,7 +2672,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2848,7 +2699,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2863,7 +2713,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2883,7 +2732,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2901,7 +2749,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2916,7 +2763,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2944,7 +2790,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2959,7 +2804,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2972,7 +2816,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -2999,7 +2842,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3014,7 +2856,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3027,7 +2868,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3054,7 +2894,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3069,7 +2908,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3089,7 +2927,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3109,7 +2946,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3124,7 +2960,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3154,7 +2989,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3169,7 +3003,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3199,7 +3032,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3214,7 +3046,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3244,7 +3075,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3259,7 +3089,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3281,7 +3110,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -3301,7 +3129,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3316,7 +3143,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3330,7 +3156,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3358,7 +3183,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3373,7 +3197,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3387,7 +3210,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -3409,7 +3231,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3424,7 +3245,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3438,7 +3258,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3466,7 +3285,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3481,7 +3299,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3495,7 +3312,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3519,7 +3335,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3534,7 +3349,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3548,7 +3362,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context" ], @@ -3569,7 +3382,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3584,7 +3396,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3599,7 +3410,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3625,7 +3435,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3640,7 +3449,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3655,7 +3463,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3681,7 +3488,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3696,7 +3502,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3717,7 +3522,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -3737,7 +3541,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3752,7 +3555,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3765,7 +3567,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3792,7 +3593,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3807,7 +3607,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3820,7 +3619,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3847,7 +3645,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3862,7 +3659,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3882,7 +3678,6 @@ null, [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -3902,7 +3697,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3917,7 +3711,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3947,7 +3740,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3962,7 +3754,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3992,7 +3783,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4007,7 +3797,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4037,7 +3826,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4052,7 +3840,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4074,7 +3861,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4094,7 +3880,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4109,7 +3894,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4123,7 +3907,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4151,7 +3934,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4166,7 +3948,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4180,7 +3961,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -4202,7 +3982,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4217,7 +3996,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4231,7 +4009,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4259,7 +4036,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4274,7 +4050,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4294,7 +4069,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4312,7 +4086,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4327,7 +4100,6 @@ "stage": 1, "fsdp": 4, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, diff --git a/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default/named_shardings.json b/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default/named_shardings.json index 78e42a8848..7e1b2785ae 100644 --- a/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default/named_shardings.json +++ b/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default/named_shardings.json @@ -7,7 +7,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -22,7 +21,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -43,7 +41,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -58,7 +55,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -86,7 +82,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -101,7 +96,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -135,7 +129,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -150,7 +143,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -164,7 +156,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -192,7 +183,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -207,7 +197,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -221,7 +210,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -241,7 +229,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -256,7 +243,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -278,7 +264,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -298,7 +283,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -313,7 +297,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -347,7 +330,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -362,7 +344,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -376,7 +357,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -404,7 +384,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -419,7 +398,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -446,7 +424,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -461,7 +438,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -495,7 +471,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -510,7 +485,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -524,7 +498,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -552,7 +525,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -567,7 +539,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -594,7 +565,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -609,7 +579,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -623,7 +592,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -645,7 +613,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -660,7 +627,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -675,7 +641,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -701,7 +666,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -716,7 +680,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -749,7 +712,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -764,7 +726,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -779,7 +740,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -805,7 +765,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -820,7 +779,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -853,7 +811,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -868,7 +825,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -889,7 +845,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -909,7 +864,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -924,7 +878,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -956,7 +909,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -971,7 +923,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1001,7 +952,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1016,7 +966,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1046,7 +995,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1061,7 +1009,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1095,7 +1042,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1110,7 +1056,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1124,7 +1069,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1152,7 +1096,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1167,7 +1110,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1181,7 +1123,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1201,7 +1142,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1216,7 +1156,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1238,7 +1177,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1258,7 +1196,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1273,7 +1210,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1307,7 +1243,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1322,7 +1257,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1336,7 +1270,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1364,7 +1297,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1379,7 +1311,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1406,7 +1337,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1421,7 +1351,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1455,7 +1384,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1470,7 +1398,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1484,7 +1411,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1512,7 +1438,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1527,7 +1452,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1554,7 +1478,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1569,7 +1492,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1583,7 +1505,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1605,7 +1526,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1620,7 +1540,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1635,7 +1554,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -1661,7 +1579,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1676,7 +1593,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1709,7 +1625,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1724,7 +1639,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1739,7 +1653,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -1765,7 +1678,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1780,7 +1692,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1813,7 +1724,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1828,7 +1738,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1849,7 +1758,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -1869,7 +1777,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1884,7 +1791,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1916,7 +1822,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1931,7 +1836,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1961,7 +1865,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1976,7 +1879,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2006,7 +1908,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2021,7 +1922,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2035,7 +1935,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2059,7 +1958,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2074,7 +1972,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2094,7 +1991,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2112,7 +2008,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2127,7 +2022,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2148,7 +2042,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2163,7 +2056,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2191,7 +2083,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2206,7 +2097,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2240,7 +2130,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2255,7 +2144,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2269,7 +2157,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2297,7 +2184,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2312,7 +2198,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2326,7 +2211,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -2346,7 +2230,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2361,7 +2244,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2383,7 +2265,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2403,7 +2284,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2418,7 +2298,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2452,7 +2331,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2467,7 +2345,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2481,7 +2358,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2509,7 +2385,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2524,7 +2399,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2551,7 +2425,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2566,7 +2439,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2600,7 +2472,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2615,7 +2486,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2629,7 +2499,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2657,7 +2526,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2672,7 +2540,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2699,7 +2566,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2714,7 +2580,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2728,7 +2593,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -2750,7 +2614,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2765,7 +2628,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2780,7 +2642,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2806,7 +2667,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2821,7 +2681,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2854,7 +2713,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2869,7 +2727,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2884,7 +2741,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2910,7 +2766,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2925,7 +2780,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2958,7 +2812,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2973,7 +2826,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2994,7 +2846,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -3014,7 +2865,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3029,7 +2879,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3061,7 +2910,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3076,7 +2924,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3106,7 +2953,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3121,7 +2967,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3151,7 +2996,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3166,7 +3010,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3200,7 +3043,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3215,7 +3057,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3229,7 +3070,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3257,7 +3097,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3272,7 +3111,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3286,7 +3124,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -3306,7 +3143,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3321,7 +3157,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3343,7 +3178,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -3363,7 +3197,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3378,7 +3211,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3412,7 +3244,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3427,7 +3258,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3441,7 +3271,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3469,7 +3298,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3484,7 +3312,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3511,7 +3338,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3526,7 +3352,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3560,7 +3385,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3575,7 +3399,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3589,7 +3412,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3617,7 +3439,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3632,7 +3453,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3659,7 +3479,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3674,7 +3493,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3688,7 +3506,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -3710,7 +3527,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3725,7 +3541,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3740,7 +3555,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3766,7 +3580,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3781,7 +3594,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3814,7 +3626,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3829,7 +3640,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3844,7 +3654,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3870,7 +3679,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3885,7 +3693,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3918,7 +3725,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3933,7 +3739,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3954,7 +3759,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -3974,7 +3778,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3989,7 +3792,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4021,7 +3823,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4036,7 +3837,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4066,7 +3866,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4081,7 +3880,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4111,7 +3909,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4126,7 +3923,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4140,7 +3936,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4164,7 +3959,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4179,7 +3973,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4199,7 +3992,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4217,7 +4009,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4232,7 +4023,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4260,7 +4050,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4275,7 +4064,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4309,7 +4097,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4324,7 +4111,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4338,7 +4124,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4366,7 +4151,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4381,7 +4165,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4395,7 +4178,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -4415,7 +4197,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4430,7 +4211,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4452,7 +4232,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4472,7 +4251,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4487,7 +4265,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4521,7 +4298,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4536,7 +4312,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4550,7 +4325,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4578,7 +4352,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4593,7 +4366,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4620,7 +4392,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4635,7 +4406,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4669,7 +4439,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4684,7 +4453,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4698,7 +4466,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4726,7 +4493,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4741,7 +4507,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4768,7 +4533,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4783,7 +4547,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4797,7 +4560,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -4819,7 +4581,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4834,7 +4595,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4849,7 +4609,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -4875,7 +4634,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4890,7 +4648,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4923,7 +4680,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4938,7 +4694,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4953,7 +4708,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -4979,7 +4733,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4994,7 +4747,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5027,7 +4779,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5042,7 +4793,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5063,7 +4813,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -5083,7 +4832,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5098,7 +4846,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5130,7 +4877,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5145,7 +4891,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5175,7 +4920,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5190,7 +4934,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5220,7 +4963,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5235,7 +4977,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5269,7 +5010,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5284,7 +5024,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5298,7 +5037,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -5326,7 +5064,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5341,7 +5078,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5355,7 +5091,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -5375,7 +5110,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5390,7 +5124,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5412,7 +5145,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -5432,7 +5164,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5447,7 +5178,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5481,7 +5211,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5496,7 +5225,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5510,7 +5238,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -5538,7 +5265,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5553,7 +5279,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5580,7 +5305,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5595,7 +5319,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5629,7 +5352,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5644,7 +5366,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5658,7 +5379,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -5686,7 +5406,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5701,7 +5420,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5728,7 +5446,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5743,7 +5460,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5757,7 +5473,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -5779,7 +5494,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5794,7 +5508,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5809,7 +5522,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -5835,7 +5547,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5850,7 +5561,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5883,7 +5593,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5898,7 +5607,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5913,7 +5621,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -5939,7 +5646,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5954,7 +5660,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5987,7 +5692,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6002,7 +5706,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6023,7 +5726,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -6043,7 +5745,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6058,7 +5759,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6090,7 +5790,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6105,7 +5804,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6135,7 +5833,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6150,7 +5847,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6180,7 +5876,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6195,7 +5890,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6209,7 +5903,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -6233,7 +5926,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6248,7 +5940,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6268,7 +5959,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -6286,7 +5976,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6301,7 +5990,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, diff --git a/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=2/named_shardings.json b/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=2/named_shardings.json index f1fc91b4b1..b3a0c7d967 100644 --- a/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=2/named_shardings.json +++ b/tests/utils/sharding_info/gpt-oss-20b/tpu7x-16/slice_1/rule_default_ici_fsdp_parallelism=-1_ici_expert_parallelism=2/named_shardings.json @@ -7,7 +7,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -22,7 +21,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -43,7 +41,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -58,7 +55,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -86,7 +82,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -101,7 +96,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -135,7 +129,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -150,7 +143,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -164,7 +156,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -192,7 +183,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -207,7 +197,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -221,7 +210,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -241,7 +229,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -256,7 +243,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -278,7 +264,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -298,7 +283,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -313,7 +297,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -347,7 +330,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -362,7 +344,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -376,7 +357,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -404,7 +384,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -419,7 +398,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -446,7 +424,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -461,7 +438,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -495,7 +471,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -510,7 +485,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -524,7 +498,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -552,7 +525,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -567,7 +539,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -594,7 +565,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -609,7 +579,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -623,7 +592,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -645,7 +613,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -660,7 +627,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -675,7 +641,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -701,7 +666,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -716,7 +680,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -749,7 +712,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -764,7 +726,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -779,7 +740,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -805,7 +765,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -820,7 +779,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -853,7 +811,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -868,7 +825,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -889,7 +845,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -909,7 +864,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -924,7 +878,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -956,7 +909,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -971,7 +923,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1001,7 +952,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1016,7 +966,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1046,7 +995,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1061,7 +1009,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1095,7 +1042,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1110,7 +1056,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1124,7 +1069,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1152,7 +1096,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1167,7 +1110,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1181,7 +1123,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1201,7 +1142,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1216,7 +1156,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1238,7 +1177,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1258,7 +1196,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1273,7 +1210,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1307,7 +1243,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1322,7 +1257,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1336,7 +1270,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1364,7 +1297,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1379,7 +1311,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1406,7 +1337,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1421,7 +1351,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1455,7 +1384,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1470,7 +1398,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1484,7 +1411,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1512,7 +1438,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1527,7 +1452,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1554,7 +1478,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1569,7 +1492,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1583,7 +1505,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -1605,7 +1526,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1620,7 +1540,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1635,7 +1554,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -1661,7 +1579,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1676,7 +1593,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1709,7 +1625,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1724,7 +1639,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1739,7 +1653,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -1765,7 +1678,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1780,7 +1692,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1813,7 +1724,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1828,7 +1738,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1849,7 +1758,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -1869,7 +1777,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1884,7 +1791,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1916,7 +1822,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1931,7 +1836,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1961,7 +1865,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1976,7 +1879,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2006,7 +1908,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2021,7 +1922,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2035,7 +1935,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2059,7 +1958,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2074,7 +1972,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2094,7 +1991,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2112,7 +2008,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2127,7 +2022,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2148,7 +2042,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2163,7 +2056,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2191,7 +2083,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2206,7 +2097,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2240,7 +2130,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2255,7 +2144,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2269,7 +2157,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2297,7 +2184,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2312,7 +2198,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2326,7 +2211,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -2346,7 +2230,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2361,7 +2244,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2383,7 +2265,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2403,7 +2284,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2418,7 +2298,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2452,7 +2331,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2467,7 +2345,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2481,7 +2358,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2509,7 +2385,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2524,7 +2399,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2551,7 +2425,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2566,7 +2439,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2600,7 +2472,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2615,7 +2486,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2629,7 +2499,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2657,7 +2526,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2672,7 +2540,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2699,7 +2566,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2714,7 +2580,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2728,7 +2593,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -2750,7 +2614,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2765,7 +2628,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2780,7 +2642,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2806,7 +2667,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2821,7 +2681,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2854,7 +2713,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2869,7 +2727,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2884,7 +2741,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -2910,7 +2766,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2925,7 +2780,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2958,7 +2812,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2973,7 +2826,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2994,7 +2846,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -3014,7 +2865,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3029,7 +2879,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3061,7 +2910,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3076,7 +2924,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3106,7 +2953,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3121,7 +2967,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3151,7 +2996,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3166,7 +3010,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3200,7 +3043,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3215,7 +3057,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3229,7 +3070,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3257,7 +3097,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3272,7 +3111,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3286,7 +3124,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -3306,7 +3143,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3321,7 +3157,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3343,7 +3178,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -3363,7 +3197,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3378,7 +3211,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3412,7 +3244,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3427,7 +3258,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3441,7 +3271,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3469,7 +3298,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3484,7 +3312,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3511,7 +3338,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3526,7 +3352,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3560,7 +3385,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3575,7 +3399,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3589,7 +3412,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -3617,7 +3439,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3632,7 +3453,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3659,7 +3479,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3674,7 +3493,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3688,7 +3506,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -3710,7 +3527,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3725,7 +3541,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3740,7 +3555,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3766,7 +3580,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3781,7 +3594,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3814,7 +3626,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3829,7 +3640,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3844,7 +3654,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -3870,7 +3679,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3885,7 +3693,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3918,7 +3725,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3933,7 +3739,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -3954,7 +3759,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -3974,7 +3778,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -3989,7 +3792,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4021,7 +3823,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4036,7 +3837,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4066,7 +3866,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4081,7 +3880,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4111,7 +3909,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4126,7 +3923,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4140,7 +3936,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4164,7 +3959,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4179,7 +3973,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4199,7 +3992,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4217,7 +4009,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4232,7 +4023,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4260,7 +4050,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4275,7 +4064,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4309,7 +4097,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4324,7 +4111,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4338,7 +4124,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4366,7 +4151,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4381,7 +4165,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4395,7 +4178,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -4415,7 +4197,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4430,7 +4211,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4452,7 +4232,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -4472,7 +4251,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4487,7 +4265,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4521,7 +4298,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4536,7 +4312,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4550,7 +4325,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4578,7 +4352,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4593,7 +4366,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4620,7 +4392,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4635,7 +4406,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4669,7 +4439,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4684,7 +4453,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4698,7 +4466,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -4726,7 +4493,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4741,7 +4507,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4768,7 +4533,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4783,7 +4547,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4797,7 +4560,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -4819,7 +4581,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4834,7 +4595,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4849,7 +4609,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -4875,7 +4634,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4890,7 +4648,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4923,7 +4680,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4938,7 +4694,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -4953,7 +4708,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -4979,7 +4733,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -4994,7 +4747,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5027,7 +4779,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5042,7 +4793,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5063,7 +4813,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -5083,7 +4832,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5098,7 +4846,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5130,7 +4877,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5145,7 +4891,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5175,7 +4920,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5190,7 +4934,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5220,7 +4963,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5235,7 +4977,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5269,7 +5010,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5284,7 +5024,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5298,7 +5037,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -5326,7 +5064,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5341,7 +5078,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5355,7 +5091,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -5375,7 +5110,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5390,7 +5124,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5412,7 +5145,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -5432,7 +5164,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5447,7 +5178,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5481,7 +5211,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5496,7 +5225,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5510,7 +5238,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -5538,7 +5265,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5553,7 +5279,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5580,7 +5305,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5595,7 +5319,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5629,7 +5352,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5644,7 +5366,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5658,7 +5379,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -5686,7 +5406,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5701,7 +5420,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5728,7 +5446,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5743,7 +5460,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5757,7 +5473,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "tensor_transpose", "context", "expert" @@ -5779,7 +5494,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5794,7 +5508,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5809,7 +5522,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -5835,7 +5547,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5850,7 +5561,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5883,7 +5593,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5898,7 +5607,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5913,7 +5621,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context" ], @@ -5939,7 +5646,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -5954,7 +5660,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -5987,7 +5692,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6002,7 +5706,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6023,7 +5726,6 @@ ], [ "fsdp", - "sequence", "tensor_transpose", "context" ] @@ -6043,7 +5745,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6058,7 +5759,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6090,7 +5790,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6105,7 +5804,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6135,7 +5833,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6150,7 +5847,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6180,7 +5876,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6195,7 +5890,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6209,7 +5903,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -6233,7 +5926,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6248,7 +5940,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -6268,7 +5959,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -6286,7 +5976,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -6301,7 +5990,6 @@ "stage": 1, "fsdp": 8, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, diff --git a/tests/utils/sharding_info/qwen3-0.6b/tpu7x-16/slice_1/rule_default/named_shardings.json b/tests/utils/sharding_info/qwen3-0.6b/tpu7x-16/slice_1/rule_default/named_shardings.json index 6208b4ba80..b86911af98 100644 --- a/tests/utils/sharding_info/qwen3-0.6b/tpu7x-16/slice_1/rule_default/named_shardings.json +++ b/tests/utils/sharding_info/qwen3-0.6b/tpu7x-16/slice_1/rule_default/named_shardings.json @@ -7,7 +7,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -22,7 +21,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -43,7 +41,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -58,7 +55,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -86,7 +82,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -101,7 +96,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -114,7 +108,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -141,7 +134,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -156,7 +148,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -169,7 +160,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -196,7 +186,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -211,7 +200,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -231,7 +219,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -251,7 +238,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -266,7 +252,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -296,7 +281,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -311,7 +295,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -341,7 +324,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -356,7 +338,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -370,7 +351,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -398,7 +378,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -413,7 +392,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -443,7 +421,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -458,7 +435,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -480,7 +456,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -500,7 +475,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -515,7 +489,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -529,7 +502,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -557,7 +529,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -572,7 +543,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -602,7 +572,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -617,7 +586,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -631,7 +599,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -659,7 +626,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -674,7 +640,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -694,7 +659,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -712,7 +676,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -727,7 +690,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -748,7 +710,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -763,7 +724,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -791,7 +751,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -806,7 +765,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -819,7 +777,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -846,7 +803,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -861,7 +817,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -874,7 +829,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -901,7 +855,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -916,7 +869,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -936,7 +888,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -956,7 +907,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -971,7 +921,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1001,7 +950,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1016,7 +964,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1046,7 +993,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1061,7 +1007,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1075,7 +1020,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1103,7 +1047,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1118,7 +1061,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1148,7 +1090,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1163,7 +1104,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1185,7 +1125,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1205,7 +1144,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1220,7 +1158,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1234,7 +1171,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1262,7 +1198,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1277,7 +1212,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1307,7 +1241,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1322,7 +1255,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1336,7 +1268,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1364,7 +1295,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1379,7 +1309,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1399,7 +1328,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1417,7 +1345,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1432,7 +1359,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1460,7 +1386,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1475,7 +1400,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1488,7 +1412,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1515,7 +1438,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1530,7 +1452,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1543,7 +1464,6 @@ "partition_spec": [ [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1570,7 +1490,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1585,7 +1504,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1605,7 +1523,6 @@ "stage", [ "fsdp", - "sequence", "tensor_transpose", "context", "expert" @@ -1625,7 +1542,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1640,7 +1556,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1670,7 +1585,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1685,7 +1599,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1715,7 +1628,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1730,7 +1642,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1744,7 +1655,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1772,7 +1682,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1787,7 +1696,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1817,7 +1725,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1832,7 +1739,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1854,7 +1760,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -1874,7 +1779,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1889,7 +1793,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1903,7 +1806,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -1931,7 +1833,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1946,7 +1847,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -1976,7 +1876,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -1991,7 +1890,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2005,7 +1903,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ], @@ -2033,7 +1930,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2048,7 +1944,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1, @@ -2068,7 +1963,6 @@ [ "fsdp", "fsdp_transpose", - "sequence", "context", "expert" ] @@ -2086,7 +1980,6 @@ "stage", "fsdp", "fsdp_transpose", - "sequence", "context", "context_autoregressive", "tensor", @@ -2101,7 +1994,6 @@ "stage": 1, "fsdp": 16, "fsdp_transpose": 1, - "sequence": 1, "context": 1, "context_autoregressive": 1, "tensor": 1,