Is your feature request related to a problem? Please describe.
When working with long-context models, memory and communication overhead quickly become a bottleneck.
Context Parallelism (CP) is a useful approach to scale sequence length by partitioning activations across GPUs.
In the NeMo framework, CP is supported via Megatron strategy (e.g., setting context_parallel_size > 1), which can significantly improve long-context training efficiency. oai_citation:0‡NVIDIA Docs
However, it is unclear whether this is supported for Gemma 4 models, especially given that some variants (e.g., MoE) may have additional constraints.
Describe the solution you'd like
I would like to understand:
- Does Gemma 4 support Context Parallelism (CP) in NeMo / AutoModel?
- If not, are there plans to support it in the future?
- Are there any limitations (e.g., MoE variants not supporting CP)?
Describe alternatives you've considered
Currently, alternatives include:
- Tensor Parallelism / Pipeline Parallelism
- Sequence Parallelism
However, these approaches are not as effective as CP for scaling long context lengths.
Is your feature request related to a problem? Please describe.
When working with long-context models, memory and communication overhead quickly become a bottleneck.
Context Parallelism (CP) is a useful approach to scale sequence length by partitioning activations across GPUs.
In the NeMo framework, CP is supported via Megatron strategy (e.g., setting
context_parallel_size > 1), which can significantly improve long-context training efficiency. oai_citation:0‡NVIDIA DocsHowever, it is unclear whether this is supported for Gemma 4 models, especially given that some variants (e.g., MoE) may have additional constraints.
Describe the solution you'd like
I would like to understand:
Describe alternatives you've considered
Currently, alternatives include:
However, these approaches are not as effective as CP for scaling long context lengths.