This example uses ORTModule to fine-tune several popular HuggingFace models.
- Clone this repo and initialize git submodule
git clone https://github.com/microsoft/onnxruntime-training-examples.git
cd onnxruntime-training-examples
git submodule update --init --recursive
git submodule foreach git pull origin main- Make sure python 3.8+ is installed
We recommend using conda to manage python environment. If you do not have conda installed, you can follow the instruction to install conda here. Once conda is installed, create a new python environment with
conda create --name myenv python=3.8- Install azureml-core
Activate conda environment just created.
conda activate myenvInstall azureml dependency for script submission.
pip install azureml-core- AzureML subscription is required to run this example. Either a config.json file (How to get config.json file from Azure Portal) or subscription_id, resource_group, workspace_name information needs to be passed in through parameter.
- The subscription should have a gpu cluster. This example was tested with GPU cluster of SKU
Standard_ND40rs_v2. See this document for creating gpu cluster.
Download config.json file in 2.1 to huggingface/script directory. Or append below run script with AzureML workspace information such as --workspace_name <your_workspace_name> --resource_group <resource_group> --subscription_id <your_subscription_id>.
Here's an example to run run bert-large with ORTModule. hf-ort.py builds a docker image based on dockerfile and submits run script to AzureML according to model and run configuration. Default docker image uses cuda 11.1.
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model bert-large --run_config ortTo run different models with different configuration, check below tables.
This table summarizes if model changes are required.
| Model | Performance Compariso | Model Change |
|---|---|---|
| bart-large | See BART | No model change required |
| bert-large | See BERT | No model change required |
| deberta-v2-xxlarge | See DeBERTa | See this commit |
| distilbert-base | See DistilBERT | No model change required |
| gpt2 | See GPT2 | No model change required |
| roberta-large | See RoBERTa | See this commit |
| t5-large | See T5 | See this PR |
Here're the different configs and description that the recipe script take through --run_config parameter.
| Config | Description |
|---|---|
| pt-fp16 | PyTorch mixed precision |
| ort | ORTModule mixed precision |
| ds_s1 | PyTorch + Deepspeed stage 1 |
| ds_s1_ort | ORTModule + Deepspeed stage 1 |
Other parameters. Please also see parameters script/hf-ort.py
| Name | Description |
|---|---|
| --model_batchsize | Model batchsize per GPU |
| --max_steps | Max step that a model will run |
| --process_count | Total number of GPUs (not GPUs per node). Adjust this if target cluster is not 8 gpus |
| --node_count | Node count |
| --skip_docker_build | Skip docker build (use last built docker saved in AzureML environment) |
| --use_cu102 | Use Cuda 10.2 dockerfile |
| --local_run | Run the model locally, azureml related parameters will be ignored |
- Benchmark methodology: We report samples/sec on
ND40rs_v2VMs (V100 32G x 8), Cuda 11, with stable releaseonnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl. Cuda 10.2 option is also available through--use_cu102flag. Please check dependency details in Dockerfile. We look at the metricsstable_train_samples_per_secondin the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller--max_stepsvalue may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust--max_stepsto a smaller value. Lastly, we do not recommend running this recipe on [NC] series VMs which uses old architecture (K80). - Cost and VM availability: The finetuning job runs for ~1hr for default 8000 steps on
ND40rs_v2VMs, which costs $22.03/hr per run. Additional costs are Azure container registry costs for docker image storage, as well as Azure Storage cost for run history storage. Please note,ND40rs_v2is not publicly available by default. To get it, after the subscription is created, user need to create a support ticket here, then ND series will be available. - On first run, this script takes ~20 mins to submit the finetuning job due to building a new docker image from Dockerfile. The step to build docker image
hf_ort_env.register(ws).build(ws).wait_for_completion()can be skipped by passing--skip_docker_buildif not running for the first time.
- A machine that you can access with GPU. This recipe was tested on 8 x 32G V100 GPUs machine.
- Know how many GPUs are there. This needs to be passed to parameter
--process_count
Build docker image.
cd huggingface/docker
sudo docker build -t hf-recipe-local-docker -f Dockerfile .
Run built docker image
- Replace
<onnxruntime-training-examples_path>to your local full path toonnxruntime-training-examples- Usually it's located at
~/onnxruntime-training-examples/
- Usually it's located at
-v /dev/shm:/dev/shmmounts/dev/shmto inside docker/dev/shm. Similarly-v <onnxruntime-training-examples_path>:/onnxruntime-training-examplesmounts<onnxruntime-training-examples_path>to inside docker/onnxruntime-training-examples/
sudo docker run -it -v /dev/shm:/dev/shm -v <onnxruntime-training-examples_path>:/onnxruntime-training-examples --gpus all hf-recipe-local-docker
Run hf-ort.py script
- Reminder to use the number of GPUs available locally to parameter
--process_count - Depending on the memory available to local GPU, you might need to overwrite default batch size by passing in
--model_batchsize --local_runruns the script locally
cd /onnxruntime-training-examples/huggingface/script/
python hf-ort.py --hf_model {hf_model} --run_config {run_config} --process_count <process_count> --local_run
If there's an Azure authentication issue, install Azure CLI here and run az login --use-device-code
The issue is most likely caused by hitting a HW limitation on the target, this can be mitigated by using the following switches
--model_batchsize - Change to smaller batchsize
--process_count - Change the number of GPUs to activate
python hf-ort.py --hf_model bart-large --run_config pt-fp16 --process_count 1 --local_run --model_batchsize 1 --max_steps 20
RoBERTa & DeBERTa currently decommissioned from the hf-ort.py script because of unresolved issues.
RoBERTa currently requires ORT >= 1.12.0 according to this issue (#11268) which was resolved in ORT 1.12.0. However, running with ORT 1.12.0 with the PTCA Docker container and on the specified machine for benchmarking causes this issue (#12312).
DeBERTa has the following unresolved issues when using Optimum's ORTTrainer: #15 and #305