Skip to content

Does Gemma 4 support Context Parallelism (CP)? #1912

@liyandong001

Description

@liyandong001

Is your feature request related to a problem? Please describe.

When working with long-context models, memory and communication overhead quickly become a bottleneck.
Context Parallelism (CP) is a useful approach to scale sequence length by partitioning activations across GPUs.

In the NeMo framework, CP is supported via Megatron strategy (e.g., setting context_parallel_size > 1), which can significantly improve long-context training efficiency. oai_citation:0‡NVIDIA Docs

However, it is unclear whether this is supported for Gemma 4 models, especially given that some variants (e.g., MoE) may have additional constraints.


Describe the solution you'd like

I would like to understand:

  • Does Gemma 4 support Context Parallelism (CP) in NeMo / AutoModel?
  • If not, are there plans to support it in the future?
  • Are there any limitations (e.g., MoE variants not supporting CP)?

Describe alternatives you've considered

Currently, alternatives include:

  • Tensor Parallelism / Pipeline Parallelism
  • Sequence Parallelism

However, these approaches are not as effective as CP for scaling long context lengths.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions