LabVLA turns a Qwen3-VL-4B-Instruct vision–language backbone into a real-time robot controller through a DiT flow-matching action expert, trained with the π0.5 recipe: FAST action-token pre-training → flow-matching post-training with knowledge insulation → task fine-tuning. This README covers installation, training, and deployment — method details are in the paper.
✨ Features • 📋 TODO • 📦 Installation • 🚀 Quick Start • 🎓 Training • 🔧 Fine-tuning • 📡 Deployment • 📝 Citation
🎓 The recipe — every stage in one framework
| Mode | What it does |
|---|---|
| VLM pre-training | FAST action-token cross-entropy on the VLM backbone. |
| Flow-matching post-training | Trains the DiT action expert to generate 50-step continuous action chunks. |
| Knowledge Isolation (KI) | Stop-gradient between VLM and action expert. |
| Task fine-tuning | Fine-tuning for downstream tasks. |
| Multi-dataset & VQA co-training | π0-style mixture with homogeneous batches. |
| delta / abs action modes | Per-dimension delta_mask — arm joints delta, gripper absolute, in one vector. |
⚙️ Engineering
- 🚀 Efficiency — selective gradient checkpointing (only a subset of modules — e.g. visual encoder or language model — is checkpointed per stage), Liger-Kernel fused ops, DeepSpeed ZeRO-2, and EMA offload together keep per-GPU batch size at 64 on 80 GB A100 with minimal speed penalty.
| Stage | A100 80 GB | BS / GPU | Global BS | ~ s / step |
|---|---|---|---|---|
| VLM Pre-training | 24 (3 × 8) | 64 | 1 536 | ≈ 7 |
| KI Post-training | 16 (2 × 8) | 64 | 1 024 | ≈ 5 |
| Task Fine-tuning | 4 | 48 | 192 | ≈ 3 |
- Model weights on Hugging Face
- Inference & deployment code
- Training & fine-tuning code
- RoboGenesis & labembodied-data — coming soon
The full training, post-training, and fine-tuning code is now available — see Training. RoboGenesis and labembodied-data are being organized and will follow soon.
Python 3.10 · CUDA 12.6 · PyTorch 2.7.1 — pinned versions in requirements.txt.
conda create -n labvla python=3.10 -y && conda activate labvla
# 1. PyTorch (CUDA 12.6) → 2. FlashAttention (built against it) → 3. everything else
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install flash_attn==2.8.3 --no-build-isolation
pip install -r requirements.txt1. Download the model from Hugging Face:
huggingface-cli download zjunlp/LabVLA-5B-Base --local-dir LabVLA-5B-Base2. Deploy — start a WebSocket inference server:
PRETRAINED_PATH=LabVLA-5B-Base bash deployment/deploy.sh3. Evaluate — connect your robot or simulator client to the server and run rollouts. See Deployment for configuration details.
One entrypoint (scripts/train.py) for all stages, launched through Accelerate + DeepSpeed ZeRO-2 (bf16). Edit the variables at the top of each launch script before running.
python -m data_process scan --root /path/to/dataset --out /tmp/report.json # detect bad episodes
python -m data_process clean --src /path/to/dataset --dst /path/to/clean \
--report /tmp/report.json # apply report (symlink copy)
python -m data_process stats --dataset /path/to/clean --schema robointer_droid # normalization stats| Subcommand | Purpose |
|---|---|
scan |
Detect bad episodes (corrupt video, decode failures, missing files). |
clean |
Apply a scan report as a renumbered symlink copy; originals untouched. |
stats |
Compute normalization statistics → meta/stats.json. |
validate |
Cross-repo integrity checks. |
preflight |
Gate a launch on HIGH/CRIT issues before it starts. |
Each dataset is described by an auto-registered DatasetSchema (src/schema/). Add a module under schemas/ or pass --dataset_schema /abs/path.py.
Train the Qwen3-VL backbone on FAST action-token cross-entropy, with robot state discretized into the prompt. Defaults to a joint mixture of robointer_droid_clean, oxe-auge_clean, RoboInter-VQA, and agibot_world (π0-style n^0.43 volume weighting).
bash launch/vlm_pretrain/train_vlm_pretrain.shTrain the DiT action expert to generate 50-step continuous action chunks, with stop-gradient between the VLM and the DiT.
bash launch/ki_posttrain/train_ki_posttrain.shNote: LabVLA is not limited to the datasets listed above. Extend it to any new dataset by adding a
DatasetSchemaunderschemas/— the full pipeline (VLM pre-training, post-training, and fine-tuning) works out of the box.
Fine-tune a post-trained checkpoint for a downstream task; this does not require the data-preparation step above.
# Edit launch/finetune/train_labutopia.sh: set PretrainedCkpt, DataRoot,
# RepoIds, DatasetSchema, and ExternalStatsPath to point at your task data.
bash launch/finetune/train_labutopia.shDownload LabVLA from Hugging Face, then deploy via the script:
bash deployment/deploy.sh@article{ren2026labvla,
title = {LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories},
author = {Ren, Baochang and Liu, Xinjie and Chen, Xi and Liu, Yanshuo and
Li, Chenxi and Gao, Daqi and Su, Zeqin and Xing, Jintao and
Xue, Zirui and Li, Rui and Zhao, Xiangyu and Qiao, Shuofei and
Pan, Minting and Zuo, Wangmeng and Bai, Lei and Zhou, Dongzhan and
Zhang, Ningyu and Chen, Huajun},
journal = {arXiv preprint arXiv:2606.13578},
year = {2026}
}Our codebase references LeRobot and Liger-Kernel. We sincerely thank their teams for the outstanding contributions to the open-source community.


