Authors: Nam V. Nguyen*, Thong T. Doan*, Luong Tran, Van Nguyen, Quang Pham
Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations.
LibMoE follows a rolling release log, with the newest milestone listed first.
| Date | Release | Highlights |
|---|---|---|
| 2025-12-30 | LibMoE v2.0 | Added MoE analysis tools, loss tracking, and extended support for language-model pretraining workflows. |
| 2024-12-30 | LibMoE v1.1 | Reduced training time by approximately 70% from ~30h to ~9h; added richer MoE diagnostics including balancing loss, z-loss, per-step training time, FLOPs, language loss, total loss, auxiliary loss, and customizable metrics; updated balance_loss_coef and router_z_loss_coef for improved performance. More details. |
| 2024-11-04 | MoE metric analysis | Introduced metric analysis utilities for MoE algorithms, aligned with the LibMoE paper. |
| 2024-11-01 | LibMoE v1.0 preprint | Released the LibMoE preprint, project webpage, and public checkpoints. Paper · Webpage |
git clone https://github.com/Fsoft-AIC/LibMoE.git
cd LibMoE-
venvpython -m venv .venv source .venv/bin/activate -
condaconda create -n libmoe python=3.9 -y conda activate libmoe
pip install --upgrade pip
pip install -e .
pip install -e .[vlm,lm,eval] # or: pip install -r requirements.txtNeed a lighter environment? Start with pip install -e . and then layer on:
- Vision-language stack:
pip install -e .[vlm,eval] - Language-model pretraining:
pip install -e .[lm] - Evaluation utilities only:
pip install -e .[eval]
After installing all required libraries, follow the component-specific guides below:
🖼️ Vision-Language Stack — Sparse Upcycling
LibMoE provides a streamlined sparse-upcycling pipeline, converting existing VLM backbones (SigLIP/CLIP × Phi) into MoE-enhanced architectures without training from scratch. The pipeline supports pre-training, pre-fine-tuning, and visual instruction tuning.
🧠 Language Modeling Stack — MoE Pretraining from Scratch
The language modeling stack focuses on end-to-end MoE pretraining from scratch, featuring a modular Transformer design, flexible routing strategies, and a suite of MoE variants for comprehensive sparse LLM research.
LibMoEv2/
├── docs/
│ ├── pretrain_llm/
│ └── sparse_upcyling/
├── language_modeling/
│ ├── framework/
│ │ ├── data_structures/
│ │ ├── dataset/
│ │ ├── helpers/
│ │ ├── interfaces/
│ │ ├── layers/
│ │ ├── loader/
│ │ ├── optimizer/
│ │ ├── task/
│ │ └── utils/
│ ├── interfaces/
│ ├── layers/
│ │ └── transformer/
│ ├── models/
│ ├── paper/
│ │ ├── deepseek/
│ │ └── moe_universal/
│ ├── scripts/
│ ├── sweeps/
│ │ ├── 154M/
│ │ └── 660M/
│ └── tasks/
└── vision_language_model/
├── evaluate/
│ ├── analysis/
│ ├── docs/
│ ├── lmms_eval/
│ ├── miscs/
│ ├── modules/
│ ├── results/
│ └── tools/
├── moe_model/
│ ├── model/
│ │ ├── language_model/
│ │ ├── moe/
│ │ ├── multimodal_encoder/
│ │ └── multimodal_projector/
│ ├── serve/
│ │ └── examples/
│ └── train/
└── scripts/
├── eval/
└── train/
If this repository supports your research, please cite:
@misc{nguyen2025libmoelibrarycomprehensivebenchmarking,
title={LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models},
author={Nam V. Nguyen and Thong T. Doan and Luong Tran and Van Nguyen and Quang Pham},
year={2025},
eprint={2411.00918},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.00918},
}
