Finetune Qwen3.6-27B in NeMo Automodel #1997
HuiyingLi
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Qwen/Qwen3.6-27Bis Alibaba's latest open-source 27B dense vision-language model featuring a hybrid linear + full attention architecture for ultra-long context processing:max_position_embeddings=262,144), extensible to ~1M tokens with YaRN scaling.<think>...</think>traces across turns.Qwen3_5ForConditionalGenerationarchitecture with Qwen3.5, with updated weights and post-training.Parallel Setup
We provide a fine-tuning recipe for Qwen3.6-27B that runs on a single node (8× H100 GPUs) using FSDP2 with activation checkpointing. The configuration uses TP=1, PP=1, and EP=1 — fully data-parallel across 8 GPUs.
The recipe freezes the audio tower and trains both the vision tower and the language model.
For scaled-up training, we also support a Tensor Parallel + Pipeline Parallel configuration based on the matching Qwen3.5-27B TP4+PP4 recipe, which runs with TP=4 and PP=4 across 2 nodes (16 GPUs). Swap the
pretrained_model_name_or_pathtoQwen/Qwen3.6-27Bto apply the same setup to Qwen3.6-27B.Data
We use the MedPix-VQA dataset as an example. MedPix-VQA is a medical visual question-answering dataset built from the MedPix radiology image archive, pairing clinical images with diagnostic Q&A.
Below is the loss curve obtained when fine-tuning on MedPix-VQA with this recipe:

Beta Was this translation helpful? Give feedback.
All reactions