This guide demonstrates how to train a DLLM (Diffusion Language Model) using the Muon/Adam optimizers in the LMMs Engine framework.
DLLM (Diffusion Language Model) is a novel architecture that processes sequences using diffusion mechanisms for improved training efficiency and performance.
let
-
upper index
$i$ :representing the position of the token within a sequence; -
lower index
$t\in[0,1]$ :denoting the denoising step-
$t=0$ :original real text$\boldsymbol{x}_{0}$ ; -
$t=1$ :all masked sequence$\boldsymbol{x}_{1}$ ;
-
As a result,$x_{t}^{i}$ is「the i-th token of the sequence at the t-th denoising step」. We refer to the transformation of a sequence from the fully-masked state
We use the objective as follows:
which upper-bounds the negative log-likelihood of the model.
In practice, we samples
Muon Optimizer is an advanced optimizer for large language models that improves convergence and training stability over AdamW. Our framework includes both single-GPU and multi-GPU (FSDP2) implementations.
We provide three example configurations for training DLLMs on the FineWeb-Edu dataset using the model adapted from Qwen3 (we name it as qwen3-dllm). If you wish to train a DLLM with a custom model or dataset, you can readily adapt the provided examples.
The training configuration is defined in YAML format. Reference configuration:
-
examples/vanila_dllm/dllm_train_adam_multi_gpu_deepspeed.yaml -
examples/vanila_dllm/dllm_train_muon_multi_gpu_fsdp2.yaml -
examples/vanila_dllm/dllm_train_muon_single_gpu.yaml
Next, Key configuration highlights:
trainer_type: dllm_trainer # Use DLLM-specific trainer
# Model Configuration
model_config:
load_from_config:
model_type: qwen3_dllm # DLLM variant of Qwen3
config:
vocab_size: 151936
hidden_size: 1024
intermediate_size: 4096
num_hidden_layers: 24
use_cache: false
# Dataset Configuration
dataset_config:
dataset_type: fineweb_edu
dataset_format: hf_dataset
dataset_path: HuggingFaceFW/fineweb-edu
packing_length: 2048
extra_kwargs:
collator_type: dllm # DLLM-specific data collator
# Muon Optimizer Settings
trainer_args:
use_muon: true # Enable Muon optimizer
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1.0e-8
learning_rate: 0.001
weight_decay: 0.01
# Training Configuration
per_device_train_batch_size: 32
gradient_accumulation_steps: 16
max_steps: 10000
warmup_steps: 1000
# Distributed Training
fsdp2: true # FSDP2 for efficient distributed training
accelerator_config:
split_batches: true
# If true, rank 0 loads the dataset once, splits it into `world_size` shards,
# and dispatches each shard to the corresponding rank.
# Ensure `per_device_train_batch_size` is divisible by `world_size`.
# When `split_batches = true`, the effective batch per device is:
# per_device_train_batch_size / world_size
Use the provided script to launch multi-GPU training.
#!/usr/bin/env bash
# Configure GPUs
GPUS=0,1,2,3
export WANDB_PROJECT="your-project-name"
export HF_HUB_DOWNLOAD_TIMEOUT=200
export HF_HUB_ETAG_TIMEOUT=200
WORLD_SIZE=$(awk -F',' '{print NF}' <<<"$GPUS")
# Launch training
CUDA_VISIBLE_DEVICES=$GPUS accelerate launch \
--multi_gpu \
--num_processes $WORLD_SIZE \
--num_machines 1 \
--machine_rank 0 \
--main_process_ip 127.0.0.1 \
--main_process_port 29504 \
--dynamo_backend=no \
--mixed_precision=no \
--module lmms_engine.launch.cli \
--config /path/to/your/config.yaml \
2>&1 | tee outputs/output.logFor single GPU:
#!/usr/bin/env bash
GPUS=0
export WANDB_PROJECT="your-project-name"
export HF_HUB_DOWNLOAD_TIMEOUT=200
export HF_HUB_ETAG_TIMEOUT=200
# For single GPU, simply run with python (no distributed launcher needed)
CUDA_VISIBLE_DEVICES=$GPUS python -m lmms_engine.launch.cli \
--config /path/to/your/config.yaml \
2>&1 | tee outputs/output_single_gpu.log
- DLLM Trainer: Specialized trainer (
dllm_trainer) optimized for diffusion language modeling. See the code insrc/lmms_engine/train/hf/dllm_trainer.py. - DLLM Collator: Custom data collator (
collator_type: dllm) for preparing batches. See example insrc/lmms_engine/datasets/collator/text_dllm_collator.py - DLLM Model Architecture: At its core, a DLLM behaves as a non-causal, mask-based language model. Thus, to repurpose a conventional AR model for DLLM training, simply change the attention mask from causal to full. See example in
src/lmms_engine/models/qwen3_dllm/modeling_qwen3_dllm.py
- FSDP2 Recommended.
- Deepspeed Currently Not Support for Muon Optimizer. But It is OK for Adam Optimizer. The default config of deepspeed is provided here
examples/ds_config/default_config.json, where we set default zero stage as 2, but you can change it to any stage as you like.