To run training, prepare a YAML config. Below are two up-to-date examples that you can use as templates.
Following is an example config:
trainer_type: fsdp2_trainer
# Dataset configuration - now includes the actual dataset definitions
dataset_config:
dataset_type: vision
dataset_format: yaml # Uses 'yaml' format for both external files and inline definitions
# Inline dataset definitions (no dataset_path needed)
datasets:
- path: data/open_thoughts_debug
data_folder: ""
data_type: arrow
# Processor configuration
processor_config:
processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"
processor_type: "qwen2_5_vl"
# Packing configuration
packing: true
packing_strategy: first_fit
packing_length: 16384
# Model configuration
model_config:
load_from_pretrained_path: "Qwen/Qwen2.5-VL-7B-Instruct"
attn_implementation: "flash_attention_2"
# Training arguments, mostly compatible with HuggingFace Trainer
trainer_args:
per_device_train_batch_size: 1
learning_rate: 1.0e-06 # we should use 1.0 to makes YAML recognize it as a float
weight_decay: 0.0
gradient_accumulation_steps: 1
gradient_checkpointing: true
num_train_epochs: 1
save_steps: 100
save_total_limit: 1
report_to: "wandb"
output_dir: "./output/debug"
warmup_ratio: 0.0
run_name: "qwen2_5_vl_config"
eval_strategy: "no"
logging_steps: 1
group_by_length: true
dataloader_num_workers: 8
bf16: true
lr_scheduler_type: "cosine"
freeze_modules: ["visual"]
use_liger_kernel: true
use_rmpad: true
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"]
reshard_after_forward: falseYou can visit the config.py file under each subfolder to see what parameters are configurable
- trainer_type: Use
hf_trainerfor standard HF Trainer orfsdp2_trainerfor PyTorch FSDP2. - dataset_config.dataset_format:
yaml. You can either setdataset_pathto an external YAML, or embed datasets inline viadatasets. - datasets: Each entry defines
path, optionaldata_folder, anddata_type(e.g.,arrow,parquet). - processor_config: Set
processor_name(e.g., a Hugging Face model id) andprocessor_type(e.g.,qwen2_5_vl). - packing: Enable sequence packing with
packing: true, and adjustpacking_strategyandpacking_length. Usefilter_overlongto drop samples exceeding limits. - video options:
video_backend,video_sampling_strategy,video_max_pixels,video_max_framescontrol video preprocessing. - model_config: Prefer
load_from_pretrained_pathand setattn_implementation(e.g.,flash_attention_2). - freeze_modules: List of submodules (e.g.,
visual) to freeze during training. - use_liger_kernel/use_rmpad: Performance optimizations. Keep enabled if supported on your stack.
- fsdp2/fsdp_config: Enable FSDP2 sharding and wrap transformer layer classes via
transformer_layer_cls_to_wrap. Tunereshard_after_forwardfor memory/perf trade-offs. - EMA (Exponential Moving Average): Enable EMA with
ema_enabled: true. Configureema_decay(default 0.9999),ema_update_every,ema_start_step, and optionally filter parameters viaema_param_filter. EMA checkpoints are saved alongside regular checkpoints and can be merged usingmerge_fsdp.pywith--state_dict_dirname pytorch_ema_model_fsdp_0.
Example launch command:
export NCCL_BLOCKING_WAIT=0
export TOKENIZERS_PARALLELISM=false
# Hugging Face setup (optional)
export HF_TOKEN="<YOUR HF_TOKEN>"
export HF_HOME="$HOME/.cache/huggingface"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export NCCL_DEBUG=INFO
CONFIG=$1 # path to your YAML config
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli config_yaml=${CONFIG}Instead of using a YAML config file, you can pass configuration directly via Hydra overrides on the command line. This is useful for quick experiments and parameter tuning.
Use the format key=value to override any configuration parameter. Hydra automatically creates the nested structure:
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli \
trainer_type=fsdp2_trainer \
dataset_config.dataset_path=/path/to/video_dataset.yaml \
dataset_config.dataset_format=yaml \
dataset_config.dataset_type=qwen3_vl_iterable \
dataset_config.processor_config.processor_name="Qwen/Qwen3-VL-8B-Instruct" \
dataset_config.processor_config.processor_type=qwen3_vl \
model_config.load_from_pretrained_path="Qwen/Qwen3-VL-8B-Instruct" \
model_config.attn_implementation=flash_attention_2 \
trainer_args.per_device_train_batch_size=1 \
trainer_args.learning_rate=2.0e-04 \
trainer_args.num_train_epochs=1 \
trainer_args.output_dir=./output/debug \
trainer_args.bf16=trueHere are frequently used parameters you can override:
Dataset Configuration:
dataset_config.dataset_path: Path to your YAML dataset configdataset_config.dataset_format: Format type (e.g.,yaml,json)dataset_config.dataset_type: Dataset type (e.g.,vision,qwen3_vl_iterable)dataset_config.processor_config.processor_name: Model name for the processordataset_config.processor_config.processor_type: Processor type to usedataset_config.packing: Enable/disable sequence packing (e.g.,packing=true)dataset_config.packing_length: Max sequence length for packingdataset_config.video_backend: Video processing backend (e.g.,qwen_vl_utils)dataset_config.video_sampling_strategy: Video sampling method (e.g.,fps)dataset_config.video_max_frames: Maximum frames per video
Model Configuration:
model_config.load_from_pretrained_path: Path or HF model ID to load frommodel_config.attn_implementation: Attention implementation (e.g.,flash_attention_2)
Training Arguments:
trainer_args.per_device_train_batch_size: Batch size per devicetrainer_args.learning_rate: Learning rate (use float notation like2.0e-04)trainer_args.num_train_epochs: Number of training epochstrainer_args.max_steps: Maximum training stepstrainer_args.gradient_accumulation_steps: Gradient accumulation stepstrainer_args.gradient_checkpointing: Enable gradient checkpointingtrainer_args.output_dir: Output directory for checkpointstrainer_args.run_name: Name for this training runtrainer_args.bf16: Use bfloat16 precisiontrainer_args.fsdp2: Enable FSDP2 distributed trainingtrainer_args.use_liger_kernel: Enable Liger kernel optimizationstrainer_args.use_rmpad: Enable padding removal optimizationtrainer_args.ema_enabled: Enable EMA (default:false)trainer_args.ema_decay: EMA decay rate (default:0.9999)trainer_args.ema_update_every: Update EMA every N steps (default:1)trainer_args.ema_start_step: Start EMA from step N (default:0)trainer_args.ema_requires_grad_only: Only apply EMA to trainable parameters (default:true)trainer_args.ema_param_filter: Filter parameters by name (supportsmode,include,exclude)trainer_args.ema_resume_from_ema: Resume training from EMA weights (default:false)
See examples/qwen3_vl/qwen3_vl_8b_train.sh for a complete training script using Hydra overrides with comprehensive parameter configuration for multi-GPU training.
You can use a YAML config file as a base and override specific parameters via CLI using Hydra's config-path and config-name:
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli \
--config-path /path/to/config_yaml/directory \
--config-name qwen2_5_vl_dp \
trainer_args.max_steps=100This loads all settings from qwen2_5_vl_dp.yaml in the specified directory and only overrides the specified parameters (CLI overrides take precedence).
- Use quotes for string values:
processor_name="Qwen/Qwen2.5-VL-7B-Instruct" - Use dot notation for nested configs:
trainer_args.learning_rate=1.0e-06 - Boolean values:
packing=trueorpacking=false - For complex values (lists/arrays), use Hydra's syntax:
trainer_args.fsdp_config.transformer_layer_cls_to_wrap=["Qwen2_5_VLDecoderLayer"] - Add new parameters with
+:+dataset_config.extra_kwargs.image_max_pixels=4194304