Finetune Qwen3.6-27B in NeMo Automodel #1997

HuiyingLi · 2026-04-22T19:03:16Z

HuiyingLi
Apr 22, 2026
Maintainer

Qwen/Qwen3.6-27B is Alibaba's latest open-source 27B dense vision-language model featuring a hybrid linear + full attention architecture for ultra-long context processing:

27B dense parameters — not MoE — with 64 layers, hidden size 5120, and intermediate size 17408.
Hybrid attention architecture: 3 Gated DeltaNet (linear attention) layers followed by 1 Gated Attention (full attention) layer, repeating. Full attention uses 24 query heads and 4 KV heads (GQA) with head_dim=256 and partial RoPE (factor 0.25).
Native 262K context window (max_position_embeddings=262,144), extensible to ~1M tokens with YaRN scaling.
Multi-Token Prediction (MTP) head for accelerated decoding.
Integrated vision encoder (27-layer ViT, patch size 16, 1152 hidden → projected to 5120) enabling image and video understanding.
Thinking mode by default with support for preserving historical <think>...</think> traces across turns.
Shares the Qwen3_5ForConditionalGeneration architecture with Qwen3.5, with updated weights and post-training.

Parallel Setup

We provide a fine-tuning recipe for Qwen3.6-27B that runs on a single node (8× H100 GPUs) using FSDP2 with activation checkpointing. The configuration uses TP=1, PP=1, and EP=1 — fully data-parallel across 8 GPUs.

distributed:
  _target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
  tp_size: 1
  cp_size: 1
  pp_size: 1
  dp_replicate_size: 1
  ep_size: 1
  sequence_parallel: false
  activation_checkpointing: true

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1
    layers_per_stage: 2
    scale_grads_in_schedule: false
    round_virtual_stages_to_pp_multiple: up
    dtype: bf16

The recipe freezes the audio tower and trains both the vision tower and the language model.

For scaled-up training, we also support a Tensor Parallel + Pipeline Parallel configuration based on the matching Qwen3.5-27B TP4+PP4 recipe, which runs with TP=4 and PP=4 across 2 nodes (16 GPUs). Swap the pretrained_model_name_or_path to Qwen/Qwen3.6-27B to apply the same setup to Qwen3.6-27B.

Data

We use the MedPix-VQA dataset as an example. MedPix-VQA is a medical visual question-answering dataset built from the MedPix radiology image archive, pairing clinical images with diagnostic Q&A.

Below is the loss curve obtained when fine-tuning on MedPix-VQA with this recipe:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetune Qwen3.6-27B in NeMo Automodel #1997

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Finetune Qwen3.6-27B in NeMo Automodel #1997

Uh oh!

HuiyingLi Apr 22, 2026 Maintainer

Parallel Setup

Data

Replies: 0 comments

HuiyingLi
Apr 22, 2026
Maintainer