GitHub - zjunlp/LabVLA: LabVLA: Grounding Vision–Language–Action Models in Scientific Laboratories

The First Vision-Language-Action Foundation Model for Scientific Laboratories

LabVLA turns a Qwen3-VL-4B-Instruct vision–language backbone into a real-time robot controller through a DiT flow-matching action expert, trained with the π0.5 recipe: FAST action-token pre-training → flow-matching post-training with knowledge insulation → task fine-tuning. This README covers installation, training, and deployment — method details are in the paper.

✨ Features • 📋 TODO • 📦 Installation • 🚀 Quick Start • 🎓 Training • 🔧 Fine-tuning • 📡 Deployment • 📝 Citation

✨ Features

🎓 The recipe — every stage in one framework

Mode	What it does
VLM pre-training	FAST action-token cross-entropy on the VLM backbone.
Flow-matching post-training	Trains the DiT action expert to generate 50-step continuous action chunks.
Knowledge Isolation (KI)	Stop-gradient between VLM and action expert.
Task fine-tuning	Fine-tuning for downstream tasks.
Multi-dataset & VQA co-training	π0-style mixture with homogeneous batches.
delta / abs action modes	Per-dimension `delta_mask` — arm joints delta, gripper absolute, in one vector.

⚙️ Engineering

🚀 Efficiency — selective gradient checkpointing (only a subset of modules — e.g. visual encoder or language model — is checkpointed per stage), Liger-Kernel fused ops, DeepSpeed ZeRO-2, and EMA offload together keep per-GPU batch size at 64 on 80 GB A100 with minimal speed penalty.

Stage	A100 80 GB	BS / GPU	Global BS	~ s / step
VLM Pre-training	24 (3 × 8)	64	1 536	≈ 7
KI Post-training	16 (2 × 8)	64	1 024	≈ 5
Task Fine-tuning	4	48	192	≈ 3

📋 TODO

Model weights on Hugging Face
Inference & deployment code
Training & fine-tuning code
RoboGenesis & labembodied-data — coming soon

The full training, post-training, and fine-tuning code is now available — see Training. RoboGenesis and labembodied-data are being organized and will follow soon.

📦 Installation

Python 3.10 · CUDA 12.6 · PyTorch 2.7.1 — pinned versions in requirements.txt.

conda create -n labvla python=3.10 -y && conda activate labvla

# 1. PyTorch (CUDA 12.6)  →  2. FlashAttention (built against it)  →  3. everything else
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install flash_attn==2.8.3 --no-build-isolation
pip install -r requirements.txt

🚀 Quick Start

1. Download the model from Hugging Face:

huggingface-cli download zjunlp/LabVLA-5B-Base --local-dir LabVLA-5B-Base

2. Deploy — start a WebSocket inference server:

PRETRAINED_PATH=LabVLA-5B-Base bash deployment/deploy.sh

3. Evaluate — connect your robot or simulator client to the server and run rollouts. See Deployment for configuration details.

🎓 Training

One entrypoint (scripts/train.py) for all stages, launched through Accelerate + DeepSpeed ZeRO-2 (bf16). Edit the variables at the top of each launch script before running.

1 · Prepare data

python -m data_process scan  --root /path/to/dataset --out /tmp/report.json     # detect bad episodes
python -m data_process clean --src  /path/to/dataset --dst /path/to/clean \
                             --report /tmp/report.json                          # apply report (symlink copy)
python -m data_process stats --dataset /path/to/clean --schema robointer_droid  # normalization stats

Subcommand	Purpose
`scan`	Detect bad episodes (corrupt video, decode failures, missing files).
`clean`	Apply a scan report as a renumbered symlink copy; originals untouched.
`stats`	Compute normalization statistics → `meta/stats.json`.
`validate`	Cross-repo integrity checks.
`preflight`	Gate a launch on HIGH/CRIT issues before it starts.

Each dataset is described by an auto-registered DatasetSchema (src/schema/). Add a module under schemas/ or pass --dataset_schema /abs/path.py.

2 · VLM Pre-training

Train the Qwen3-VL backbone on FAST action-token cross-entropy, with robot state discretized into the prompt. Defaults to a joint mixture of robointer_droid_clean, oxe-auge_clean, RoboInter-VQA, and agibot_world (π0-style n^0.43 volume weighting).

bash launch/vlm_pretrain/train_vlm_pretrain.sh

3 · Flow-Matching Post-training (Knowledge Insulation)

Train the DiT action expert to generate 50-step continuous action chunks, with stop-gradient between the VLM and the DiT.

bash launch/ki_posttrain/train_ki_posttrain.sh

Note: LabVLA is not limited to the datasets listed above. Extend it to any new dataset by adding a DatasetSchema under schemas/ — the full pipeline (VLM pre-training, post-training, and fine-tuning) works out of the box.

🔧 Fine-tuning

Fine-tune a post-trained checkpoint for a downstream task; this does not require the data-preparation step above.

# Edit launch/finetune/train_labutopia.sh: set PretrainedCkpt, DataRoot,
# RepoIds, DatasetSchema, and ExternalStatsPath to point at your task data.
bash launch/finetune/train_labutopia.sh

📡 Deployment

Download LabVLA from Hugging Face, then deploy via the script:

bash deployment/deploy.sh

📝 Citation

@article{ren2026labvla,
  title   = {LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories},
  author  = {Ren, Baochang and Liu, Xinjie and Chen, Xi and Liu, Yanshuo and
             Li, Chenxi and Gao, Daqi and Su, Zeqin and Xing, Jintao and
             Xue, Zirui and Li, Rui and Zhao, Xiangyu and Qiao, Shuofei and
             Pan, Minting and Zuo, Wangmeng and Bai, Lei and Zhou, Dongzhan and
             Zhang, Ningyu and Chen, Huajun},
  journal = {arXiv preprint arXiv:2606.13578},
  year    = {2026}
}

🙏 Acknowledgments

Our codebase references LeRobot and Liger-Kernel. We sincerely thank their teams for the outstanding contributions to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs		configs
data_process		data_process
deployment		deployment
launch		launch
schemas		schemas
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The First Vision-Language-Action Foundation Model for Scientific Laboratories

✨ Features

📋 TODO

📦 Installation

🚀 Quick Start

🎓 Training

1 · Prepare data

2 · VLM Pre-training

3 · Flow-Matching Post-training (Knowledge Insulation)

🔧 Fine-tuning

📡 Deployment

📝 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The First Vision-Language-Action Foundation Model for Scientific Laboratories

✨ Features

📋 TODO

📦 Installation

🚀 Quick Start

🎓 Training

1 · Prepare data

2 · VLM Pre-training

3 · Flow-Matching Post-training (Knowledge Insulation)

🔧 Fine-tuning

📡 Deployment

📝 Citation

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages