Finetune DeepSeek V4 Flash with NeMo Automodel #2052

khazic · 2026-04-25T08:22:37Z

khazic
Apr 25, 2026

NeMo Automodel now supports DeepSeek V4 Flash (deepseek-ai/DeepSeek-V4-Flash) — DeepSeek's latest fine-grained MoE language model with a hybrid attention zoo and hash-routing first layers. The PR (#2039) adds the model definition, state-dict adapter, V4-aware pipeline-parallel forward, a checkpoint loader path for FP4 e2m1fn / FP8 e8m0fnu / FP8 e5m2 dtypes, and finetune recipes.

Key features of DeepSeek V4 Flash

All-MoE stack with 43 transformer layers: no dense MLP layers; every block uses MoE FFN with 256 routed experts + 1 shared expert and top-6 routing.
MoE routing: sqrtsoftplus scoring with noaux_tc topk method and a clamped SwiGLU activation on routed experts (swiglu_limit=10.0) — both new branches in NeMo Automodel's shared Gate.forward and MoE activation dispatch.
Hybrid per-layer attention via compress_ratios:
- compress_ratio = 0 → pure Sliding-Window Attention (SWA) with a learned per-head attention sink.
- compress_ratio = 4 → Compressed Sparse Attention (CSA): a Compressor in overlap mode pools 2*ratio raw tokens per compressed token, an Indexer selects the top-k most relevant compressed positions per query, and an explicit additive [B, 1, S, P_total] mask enforces causal correctness.
- compress_ratio = 128 → Hierarchical Compressed Attention (HCA): Compressor only (no Indexer), non-overlap pooling, deterministic p < (q+1) // ratio causal mask.
Dual RoPE bases: theta=10000 for compress_ratio==0 layers and theta=160000 (with YaRN scaling) for compress_ratio>0 layers; the compress-rope is applied to both the main attention Q/KV and the Compressor sub-module on those layers. Encoded as INTERLEAVED pairs (view_as_complex style) to match the released checkpoint.
Attention shape: GQA with single KV head (num_key_value_heads=1) broadcast to all 64 attention heads, Q-LoRA (q_lora_rank=1024) and grouped O-LoRA (o_lora_rank=1024, o_groups=8) — not MLA. Per-head non-learnable rsqrt on Q after wq_b matches the inference reference.
Hash-routing first layers: the first num_hash_layers (default 3) blocks use a DeepseekV4HashGate with a tid2eid lookup table for token→expert routing, instead of the score-based gate. input_ids is threaded through DeepseekV4Model and the V4-aware pipeline forward; under PP, hash layers live on stage 0 where input_ids is available.
Hyper-Connections (HC): every block maintains hc_mult=4 copies of the hidden state, mixed via a learned col-norm-first Sinkhorn router (hc_split_sinkhorn). pre = sigmoid + eps, post = 2 * sigmoid (no +eps), comb = softmax(dim=-1) + eps followed by Sinkhorn — produces a doubly-stochastic mixing matrix per block.
Optional Multi-token prediction (MTP) layer(s) via num_nextn_predict_layers (disabled by default in the validate harness, configurable for full training).
Context window: max_position_embeddings = 1,048,576 (1M tokens).

Checkpoint format support

The released DSV4-Flash safetensors mix several quantization formats; the state-dict adapter handles all of them transparently:

Routed experts: FP4 e2m1fn packed two values per int8 byte, with per-row 32-col FP8 e8m0fnu scales — unpacked on load, re-emitted in matching packed placeholders on to_hf so DCP shape/dtype validation lines up with on-disk layout.
Shared experts + non-expert weights: standard FP8 e4m3fn 128×128 block scales.
Hash layers' gate has no bias on disk: the adapter reads num_hash_layers from the checkpoint's config.json and drops the corresponding bias keys before DCP load.
Indexer / Compressor key flattening: on disk the Indexer sits as a sibling of the Compressor with its own nested compressor (indexer.compressor.{ape,norm,wgate,wkv} + indexer.{wq_b,weights_proj}); the adapter renames these to land at our compressor.indexer.* flat layout.

A new in-tree HuggingFaceStorageReader recognizes F8_E8M0 / F8_E5M2 dtypes (the upstream reader silently dropped them), restoring DCP metadata on every rank for these checkpoints.

Finetuning recipes

Two recipes in examples/llm_finetune/deepseek_v4/:

deepseek_v4_flash_validate.yaml — single-node 8×A100 infra validation on a 4-layer truncated harness exercising the full attention zoo (compress_ratios=[0, 0, 4, 128] → SWA / SWA / CSA / HCA), num_hash_layers=2, pp=2 ep=4.
deepseek_v4_flash_hellaswag.yaml — HellaSwag finetune recipe; the yaml header documents how to scale num_hidden_layers / ep_size for the full 43-layer multi-node run.

Layer-parity validation

The bringup was validated against the official DeepSeek inference reference (dsv4flash/inference/model.py) by per-tensor dump bisection. On the 4-layer parity harness (compress_ratios=[0,0,4,128], num_hash_layers=2, PP=1 EP=8):

Final-logits cos: 0.998 vs reference, top-1 token matches.
Every block cos ≥ 0.987.

Data

We use HellaSwag for the end-to-end full finetune. Below is the loss curve from a 43-layer full-finetune run with the full attention zoo (SWA + CSA + HCA) live:

Many thanks to @HuiyingLi @khazic for all contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune DeepSeek V4 Flash with NeMo Automodel #2052

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Finetune DeepSeek V4 Flash with NeMo Automodel #2052

Uh oh!

khazic Apr 25, 2026

Key features of DeepSeek V4 Flash

Checkpoint format support

Finetuning recipes

Layer-parity validation

Data

Replies: 0 comments

khazic
Apr 25, 2026