The CI is setup with github actions using the on-demand EC2 backend.
This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.
The workflow file is at .github/workflows/main.yml
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@v2
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0dfaabfa78a779fbc
ec2-instance-type: p3.8xlarge
subnet-id: subnet-3502b45e
security-group-id: sg-e8f46d9d
ec2-image-idis the AMI, which has to be created, or copied to the correspondingaws-regionregion the script requests.subnet-idcomes from: https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:security-group-idcomes from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#SecurityGroups:
It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.
To pre-install things connect to the instance manually and install what's desired
- choose and start an EC2 instance
- connect to it as
ubuntu, thensudo suas the runner runs asroot. I couldn't find a way around it.
ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" ubuntu@ec2-3-14-127-35.us-east-2.compute.amazonaws.com
Once installed, stop the instance.
Then create a new AMI (see below) and update the script using the new AMI.
Steps used to setup fixed software (won't be installed at test time)
- install cuda: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation
torch 1.9.0/cu-11.1
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
- all kinds of prerequisites
pip install transformers
wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
pip install -r requirements-ds.txt
wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
pip install -r requirements-ms.txt
- apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:
--- a/setup.py
+++ b/setup.py
@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
print(raw_output + "from " + cuda_dir + "/bin\n")
if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
+ return
raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
"not match the version used to compile Pytorch binaries. " +
"Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
install it:
git clone https://github.com/NVIDIA/apex
cd code/apex
build.sh
Once the needed things got installed (and every time anything new is installed) a new AMI must created (this is like an .iso image snapshot)
- go to https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
- choose the image to create a new image from
- Actions -> Create Image
Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.
Finally, once created, the script needs to be updated to that new AMI id (key ec2-image-id)
Set up guide: https://github.com/machulav/ec2-github-runner
Launching an EC2 instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
- All available instances: https://aws.amazon.com/ec2/instance-types/