Finetune Gemma4 family of models with NeMo Automodel #2005

athitten · 2026-04-23T00:05:41Z

athitten
Apr 23, 2026
Collaborator

NeMo Automodel added day 0 support for gemma4 family of models. Gemma4 family of models are the latest open-source models from Google targeted towards different types of hardware like phone, laptop and data center GPUs depending the use case. The small models google/gemma-4-E2B-it, google/gemma-4-E4B-it are trimodal supporting audio, image and text inputs while the larger ones: google/gemma-4-31B-it and google/gemma-4-26B-A4B-it support image and text inputs.

Key features of Gemma4:

Per-Layer Embeddings in the smaller Gemma4 models gemma-4-E2B-it and gemma-4-E4B-it making effective parameters to be only 2B and 4B respectively .
Sliding window attention along with regular or global attention: Similar to Gemma3, Gemma4 interleaves sliding window attention with global attention reducing the compute needed for attention. E2B variant follows 4:1 pattern (4 sliding attention layers followed by a global attention layer) and all other models follow a 5:1 pattern. Gemma4 models have the last layer as global attention always (unlike Gemma3). E2B and E4B have a sliding window of 512 tokens, while the larger ones use 1024 tokens.
Grouped Query Attention (GQA): Sliding/local attention layers in the 31B and 26B-A4B variants use 2 Query heads sharing one KV head, while Global attention layers use 8 Query heads sharing one KV head. The smaller models also use GQA.
Context window: E2B and E4B support up to 128K tokens (max_position_embeddings=131072), while 31B and 26B-A4B extend to 256K tokens (max_position_embeddings=262144). All variants use the split-RoPE scheme introduced in Gemma3: θ=10,000 (default RoPE) on sliding layers and θ=1,000,000 with a 0.25 proportional partial-rotary factor on global layers.
Layer counts and hidden dimensions:
- gemma-4-E2B-it: 35 layers, hidden 1536, FFN intermediate 6144 (uses a double-wide MLP variant), 8 Q heads / 1 KV head.
- gemma-4-E4B-it: 42 layers, hidden 2560, FFN intermediate 10240, 8 Q heads / 2 KV heads.
- gemma-4-31B-it: 60 layers, hidden 5376, FFN intermediate 21504, 32 Q heads with 16 KV heads on sliding layers and 4 KV heads on global layers.
- gemma-4-26B-A4B-it: 30 layers, hidden 2816, and a Mixture-of-Experts FFN with 128 experts, top-8 routing, expert intermediate size 704 (dense FFN intermediate 2112); 16 Q heads with 8 KV heads on sliding layers and 2 KV heads on global layers.
Shared vocabulary and tied embeddings: All four models share a 262,144-token vocabulary with tied input/output embeddings and a final-logit softcap of 30.0.
Cross-layer KV sharing in the E-variants: gemma-4-E2B-it shares KV projections across 20 of its 35 layers and gemma-4-E4B-it across 18 of its 42 layers (num_kv_shared_layers), substantially reducing KV-cache memory for on-device inference. The 31B and 26B-A4B variants do not share KV across layers.
Asymmetric head dimensions: Sliding-attention layers use head_dim=256, while global-attention layers use a wider head_dim=512, giving global layers more capacity to aggregate long-range information while keeping local layers cheap.
Vision tower: E2B/E4B use a 16-layer ViT (hidden 768, 12 heads), and 31B/26B-A4B use a larger 27-layer ViT (hidden 1152, 16 heads); both emit 280 soft image tokens per image with 16×16 patches.
Audio tower (E2B/E4B only): A 12-layer encoder (hidden 1024, 8 heads) with convolutional subsampling and chunked local attention (left-context 13, right-context 0), whose are projected into the text backbone hidden size.

Finetuning Recipes:

We provide full fine-tuning, as well as PEFT recipes for all gemma4 variants:

gemma-4-E2B-it: gemma4_2b.yaml for fine-tuning and gemma4_2b_peft.yaml for PEFT with lora.
gemma-4-E4B-it: gemma4_4b.yaml for fine-tuning and gemma4_4b_peft.yaml for PEFT with lora.
gemma-4-31B-it: gemma4_31b.yaml for fine-tuning and gemma4_31b_peft.yaml for PEFT with lora. Both the recipes use FSDP2 with activation checkpointing.
gemma-4-26B-A4B-it: gemma4_26b_a4b_moe.yaml for fine-tuning and gemma4_26b_a4b_moe_peft.yaml for PEFT with lora. Both the recipes use FSDP2 with expert parallelism (EP=8, 16 experts per GPU).

Data

We use the MedPix-VQA dataset as an example. MedPix-VQA is a medical visual question-answering dataset built from the MedPix radiology image archive, pairing clinical images with diagnostic Q&A.

Below are the loss curves obtained when fine-tuning on MedPix-VQA with these recipes:

While these are single node recipes, to further scale the largest dense model in the family (gemma-4-31B-it), we also provide recipes with tensor and pipeline parallelism: gemma4_31b_tp4.yaml, gemma4_31b_tp4_pp2.yaml and gemma4_31b_tp4_pp4.yaml.

Many thanks to @HuiyingLi @khazic @sharonyu-115 @akoumpa for all contributions!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetune Gemma4 family of models with NeMo Automodel #2005

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Finetune Gemma4 family of models with NeMo Automodel #2005

Uh oh!

Uh oh!

athitten Apr 23, 2026 Collaborator

Key features of Gemma4:

Finetuning Recipes:

Data

Replies: 0 comments

athitten
Apr 23, 2026
Collaborator