This document contains MFU metrics for various model training configurations using LMMs-Engine, measured with FSDP distributed training across multiple GPUs. All the experiment are being conducted with 4*8 A800 80G SXM using fsdp2_trainer with vision_iterable dataset for on-the-fly stream packing.
Model FLOPs Utilization (MFU) measures the efficiency of GPU usage during training, representing the ratio of achieved FLOPs to theoretical peak FLOPs. All configurations use:
- Sequence Packing: First-fit bin packing strategy for optimal throughput
- Unpadding: Remove padding via
use_rmpadto eliminate wasted computation - Liger Kernel: Fused operations for memory efficiency
- Iterable Dataset: Streaming data loading for trillion-token pretraining
- FSDP: Fully Sharded Data Parallel v2 for distributed training
Configuration:
- 4 Node x 8 A800-SXM4 GPU
- Packing length: 81,920
- Optimization: Remove padding + Liger kernel + Iterable dataset
- Training mode: FSDP distributed
Achieved MFU: 0.50-0.55 (50-55%)
Key settings:
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 81920
trainer_args:
use_rmpad: true
use_liger_kernel: true
fsdp2: trueConfiguration:
- 4 Node x 8 A800-SXM4 GPU
- Packing length: 61,440
- Optimization: Remove padding + Liger kernel + Iterable dataset
- Training mode: FSDP distributed
Achieved MFU: 0.30-0.40 (30-40%)
Key settings:
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 61440
trainer_args:
use_rmpad: true
use_liger_kernel: true
fsdp2: trueNote: Reduced packing length compared to text models due to Vision Transformer (ViT) overhead. See Important Considerations below.
Configuration:
- 4 Node x 8 A800-SXM4 GPU
- Packing length: 51,200
- Optimization: Remove padding + Liger kernel + Iterable dataset
- Training mode: FSDP distributed
Achieved MFU: 0.25-0.35 (25-35%)
Key settings:
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 51200
trainer_args:
use_rmpad: true
use_liger_kernel: true
fsdp2: trueConfiguration:
- 4 Node x 8 A800-SXM4 GPU
- Packing length: 51,200
- Sequence Parallel degree: 2
- Optimization: Remove padding + Liger kernel + Iterable dataset
- Training mode: FSDP distributed + Ulysses Sequence Parallel
Achieved MFU: 0.20-0.25 (20-25%)
Key settings:
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 51200
trainer_args:
use_rmpad: true
use_liger_kernel: true
fsdp2: true
sp_ulysses_degree: 2 # Sequence parallel degreeNote: Sequence parallelism reduces MFU due to communication overhead, but enables training with longer sequences and higher batch sizes that wouldn't fit in single GPU memory.
Configuration:
- 4 Node x 8 H100 GPU
- Packing length: 10,240
- Optimization: Liger kernel + Iterable dataset
- Training mode: FSDP distributed + Ulysses Sequence Parallel
Achieved MFU: 0.25-0.30 (25-30%)
Key settings:
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 10240
trainer_args:
use_rmpad: false
use_liger_kernel: true
fsdp2: trueThe reported MFU metrics do not include Vision Transformer (ViT) FLOPs computation. This is important because:
- ViT FLOPs are non-negligible: The actual total computational work includes ViT encoding of image/video tokens, but MFU calculation focuses on language model FLOPs
- Memory overhead: ViT processing requires additional GPU memory for intermediate activations and attention computations
- Packing length reduction: For multimodal models, you may need to reduce packing length compared to text-only models to accommodate ViT memory requirements
When transitioning between modalities, consider:
- Text-only models: Can use longest packing length (81,920) for maximum throughput
- Image models: Reduce packing length to ~61,440 due to ViT visual token processing
- Video models: Further reduce to ~51,200 due to accumulated visual tokens from multiple frames
If you encounter out-of-memory errors, try reducing packing length in increments while monitoring MFU to find the optimal balance between throughput and memory usage.
| Optimization | Memory Savings | Speed Improvement | Notes |
|---|---|---|---|
Remove Padding (use_rmpad) |
20-30% | 2-3× on variable sequences | Requires Flash Attention |
| Liger Kernel | ~30% | 10-20% | Fused operations for common kernels |
| Sequence Packing | Varies | 35-40% MFU vs 20-25% | First-fit bin packing with variable lengths |
| Sequence Parallel (SP) | 2-3× (SP degree) | Reduced efficiency | Enables ultra-long contexts |