CI setup

The CI is setup with github actions using the on-demand EC2 backend.

This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.

The workflow file

The workflow file is at .github/workflows/main.yml

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      - name: Start EC2 runner
        id: start-ec2-runner
        uses: machulav/ec2-github-runner@v2
        with:
          mode: start
          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
          ec2-image-id: ami-0dfaabfa78a779fbc
          ec2-instance-type: p3.8xlarge
          subnet-id: subnet-3502b45e
          security-group-id: sg-e8f46d9d

ec2-image-id is the AMI, which has to be created, or copied to the corresponding aws-region region the script requests.
subnet-id comes from: https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:
security-group-id comes from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#SecurityGroups:

It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.

Connect to instance

To pre-install things connect to the instance manually and install what's desired

choose and start an EC2 instance
connect to it as ubuntu, then sudo su as the runner runs as root. I couldn't find a way around it.

ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" ubuntu@ec2-3-14-127-35.us-east-2.compute.amazonaws.com

Once installed, stop the instance.

Then create a new AMI (see below) and update the script using the new AMI.

Prepare the machine

Steps used to setup fixed software (won't be installed at test time)

install cuda: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation

install fixed packages

torch 1.9.0/cu-11.1

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

all kinds of prerequisites

pip install transformers
wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
pip install -r requirements-ds.txt
wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
pip install -r requirements-ms.txt

apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:

--- a/setup.py
+++ b/setup.py
@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
     print(raw_output + "from " + cuda_dir + "/bin\n")

     if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
+        return
         raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
                            "not match the version used to compile Pytorch binaries.  " +
                            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +

install it:

git clone https://github.com/NVIDIA/apex
cd code/apex
build.sh

make a new AMI image

Once the needed things got installed (and every time anything new is installed) a new AMI must created (this is like an .iso image snapshot)

go to https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
choose the image to create a new image from
Actions -> Create Image

Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.

Finally, once created, the script needs to be updated to that new AMI id (key ec2-image-id)

Guides

Set up guide: https://github.com/machulav/ec2-github-runner

Launching an EC2 instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

All available instances: https://aws.amazon.com/ec2/instance-types/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI setup

The workflow file

Connect to instance

Prepare the machine

install fixed packages

make a new AMI image

Guides

FilesExpand file tree

ci.md

Latest commit

History

ci.md

File metadata and controls

CI setup

The workflow file

Connect to instance

Prepare the machine

install fixed packages

make a new AMI image

Guides