Skip to content

Commit ee0919e

Browse files
author
Mark Saroufim
authored
Mi355 (#455)
* Add MI355X GPU support for AMD GitHub runner Add MI355X to GitHubGPU enum, GPU_TO_SM mapping, and github launcher runner routing with runner label mia1-p02-g29. * Use amd-runner Docker container for MI355X workflow Add container image ghcr.io/gpu-mode/amd-runner:main with GPU device passthrough to amd_workflow.yml. Add numpy to AMD_REQUIREMENTS. * Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps - Upgrade ROCm from 6.3.1 to 7.2 - Upgrade PyTorch to nightly rocm7.2 - Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs - Remove UCX, OpenMPI, and rocSHMEM builds (no longer needed) * Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index * Fix container permissions: run as root for GitHub Actions compatibility * Revert "Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index" This reverts commit bb5f2ee. * Revert "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps" This reverts commit bdc4523. * Simplify AMD workflow for MI355X: use container deps, skip requirements install * Reapply "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps" This reverts commit e09a2cd. * Update AMD Dockerfile to ROCm 7.1 stable, latest aiter, remove multi-GPU deps - Upgrade ROCm from 6.3.1 to 7.1 (stable, matches host ROCm 7.0.1) - Use stable torch 2.10.0+rocm7.1 instead of nightly - Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs - Remove UCX, OpenMPI, and rocSHMEM builds * Use mia1-p02-g29 runner to build AMD Docker image * Add workspace cleanup step before checkout in AMD Docker build Fixes EACCES errors from root-owned files left by previous container runs. * Remove workspace cleanup step from AMD Docker build * Use GITHUB_TOKEN instead of PUBLISH_TOKEN for ghcr.io login * Fix Dockerfile for Ubuntu 24.04 (Noble) base image - Replace python3.10 packages with python3 equivalents - Use noble ROCm package instead of jammy - Add --break-system-packages for pip on Noble - Remove git-core PPA (not needed on Noble) - Remove linux-headers install (not available during build) * Remove pip upgrade step (incompatible with Noble system pip) * Use amd-runner:mi355 Docker image with working aiter + ROCm * Fix pip install: add --break-system-packages for container environment * Update amd-docker.Dockerfile * Set minimum GitHub timeout to DEFAULT_GITHUB_TIMEOUT_MINUTES Ensures the workflow timeout is at least 30 minutes to account for Docker image pulls and container initialization on new runners.
1 parent 87f3db8 commit ee0919e

5 files changed

Lines changed: 29 additions & 102 deletions

File tree

.github/workflows/amd_workflow.yml

Lines changed: 6 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ on:
1313
runner:
1414
description: 'AMD runner to run workflow on'
1515
required: true
16-
default: "amdgpu-mi300-x86-64"
16+
default: "mia1-p02-g29"
1717
type: string
1818
requirements:
1919
description: 'Contents for a requirements.txt file'
@@ -25,6 +25,9 @@ run-name: 'AMD Job - ${{ github.event.inputs.run_id }}'
2525
jobs:
2626
run:
2727
runs-on: ${{ github.event.inputs.runner }}
28+
container:
29+
image: ghcr.io/gpu-mode/amd-runner:mi355
30+
options: --user root --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 64G
2831
strategy:
2932
fail-fast: false
3033
timeout-minutes: 20
@@ -42,34 +45,14 @@ jobs:
4245
# Now write to file (won't be logged since it's masked)
4346
echo "$PAYLOAD" > payload.json
4447
45-
- name: Set venv directory based on runner
46-
run: |
47-
if [[ "${{ github.event.inputs.runner }}" == "amdgpu-mi250-x86-64" ]]; then
48-
echo "VENV_DIR=/groups/aig_sharks/pytorch_venv" >> $GITHUB_ENV
49-
fi
50-
51-
- name: Setup Virtual Environment and Install Dependencies
48+
- name: Install kernelbot
5249
shell: bash
5350
run: |
54-
if [[ "${{ github.event.inputs.runner }}" == "amdgpu-mi250-x86-64" ]]; then
55-
python -m venv ${VENV_DIR}
56-
source ${VENV_DIR}/bin/activate
57-
fi
58-
pip install --upgrade pip
59-
if [[ -n "${{ github.event.inputs.requirements }}" ]]; then
60-
cat > "requirements.txt" <<'EOL'
61-
${{ github.event.inputs.requirements }}
62-
EOL
63-
pip install -r "requirements.txt"
64-
fi
65-
pip install -e .
51+
pip install --break-system-packages -e .
6652
6753
- name: Run script
6854
shell: bash
6955
run: |
70-
if [[ "${{ github.event.inputs.runner }}" == "amdgpu-mi250-x86-64" ]]; then
71-
source ${VENV_DIR}/bin/activate
72-
fi
7356
python3 src/runners/github-runner.py
7457
7558
- name: Upload training artifacts

.github/workflows/publish_amd_docker.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ env:
1010

1111
jobs:
1212
build-and-push-image:
13-
runs-on: amd-docker
13+
runs-on: mia1-p02-g29
1414
# Sets the permissions granted to the `PUBLISH_TOKEN` for the actions in this job.
1515
permissions:
1616
contents: read
@@ -23,7 +23,7 @@ jobs:
2323
with:
2424
registry: ${{ env.REGISTRY }}
2525
username: ${{ github.actor }}
26-
password: ${{ secrets.PUBLISH_TOKEN }}
26+
password: ${{ secrets.GITHUB_TOKEN }}
2727
- name: Extract metadata (tags, labels) for Docker
2828
id: meta
2929
uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7

docker/amd-docker.Dockerfile

Lines changed: 14 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,10 @@
11
FROM ghcr.io/actions/actions-runner:latest
22

33
ENV CXX=clang++
4-
ENV UCX_CXX=g++
5-
ENV UCX_CC=gcc
64

75
RUN sudo apt-get update -y \
8-
&& sudo apt-get install -y software-properties-common \
9-
&& sudo add-apt-repository -y ppa:git-core/ppa \
10-
&& sudo apt-get update -y \
116
&& sudo apt-get install -y --no-install-recommends \
7+
software-properties-common \
128
curl \
139
ca-certificates \
1410
git \
@@ -22,100 +18,43 @@ RUN sudo apt-get update -y \
2218
lld \
2319
wget \
2420
psmisc \
25-
python3.10-venv \
21+
python3-venv \
22+
python3-pip \
23+
python3-setuptools \
24+
python3-wheel \
25+
python3-dev \
2626
&& sudo rm -rf /var/lib/apt/lists/*
2727

28-
RUN sudo apt-get update && sudo apt-get install -y python3.10 python3-pip python-is-python3 python3-setuptools python3-wheel libpython3.10
29-
3028
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash && \
3129
sudo apt-get install git-lfs
3230

3331
RUN sudo groupadd -g 109 render
3432

3533
RUN sudo apt update -y \
36-
&& sudo apt install -y "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)" \
3734
&& sudo usermod -a -G render,video runner \
38-
&& wget https://repo.radeon.com/amdgpu-install/6.3.1/ubuntu/jammy/amdgpu-install_6.3.60301-1_all.deb \
39-
&& sudo apt install -y ./amdgpu-install_6.3.60301-1_all.deb \
35+
&& wget https://repo.radeon.com/amdgpu-install/7.1/ubuntu/noble/amdgpu-install_7.1.70100-1_all.deb \
36+
&& sudo apt install -y ./amdgpu-install_7.1.70100-1_all.deb \
4037
&& sudo apt update -y \
4138
&& sudo apt install -y rocm
4239

43-
RUN sudo pip install --upgrade pip
40+
ENV ROCM_PATH=/opt/rocm
4441

45-
RUN sudo pip install --no-cache-dir torch==2.10.0.dev20250916+rocm6.3 pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.3
42+
RUN sudo pip install --break-system-packages --no-cache-dir torch==2.10.0+rocm7.1 --index-url https://download.pytorch.org/whl/rocm7.1
4643

4744
RUN git clone --recursive https://github.com/ROCm/aiter.git \
4845
&& cd aiter \
49-
&& git checkout 1d88633958236e942cba3c283864282f7af3ebc5 \
50-
&& sudo pip install -r requirements.txt \
46+
&& git checkout f3be04a12a0cfd6b5e2c7a94edc774f1bc24460d \
47+
&& sudo pip install --break-system-packages -r requirements.txt \
5148
&& sudo python3 setup.py develop
5249

5350
RUN sudo mkdir -p /home/runner/aiter/aiter/jit/build \
5451
&& sudo chown -R runner:runner /home/runner/aiter/aiter/jit/build
5552

56-
RUN sudo pip install \
53+
RUN sudo pip install --break-system-packages \
5754
ninja \
5855
numpy \
5956
packaging \
6057
wheel \
61-
tinygrad
62-
63-
RUN sudo pip install git+https://github.com/ROCm/iris.git
64-
65-
RUN sudo apt-get update -y \
66-
&& sudo apt-get install -y --no-install-recommends \
67-
autoconf \
68-
automake \
69-
libtool \
70-
pkg-config \
71-
build-essential \
72-
gfortran \
73-
flex \
74-
bison \
75-
libomp-dev \
76-
libhwloc-dev \
77-
libnuma-dev \
78-
&& sudo rm -rf /var/lib/apt/lists/*
79-
80-
ENV UCX_INSTALL_DIR=/opt/ucx
81-
ENV OMPI_INSTALL_DIR=/opt/openmpi
82-
ENV ROCSHMEM_INSTALL_DIR=/opt/rocshmem
83-
ENV ROCM_PATH=/opt/rocm
84-
85-
RUN cd /tmp \
86-
&& git clone https://github.com/openucx/ucx.git -b v1.17.x \
87-
&& cd ucx \
88-
&& ./autogen.sh \
89-
&& CC=gcc CXX=g++ ./configure --prefix=${UCX_INSTALL_DIR} --with-rocm=${ROCM_PATH} --enable-mt --disable-optimizations \
90-
&& make -j$(nproc) \
91-
&& sudo make install \
92-
&& cd / \
93-
&& sudo rm -rf /tmp/ucx
94-
95-
RUN cd /tmp \
96-
&& git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x \
97-
&& cd ompi \
98-
&& ./autogen.pl \
99-
&& ./configure --prefix=${OMPI_INSTALL_DIR} --with-rocm=${ROCM_PATH} --with-ucx=${UCX_INSTALL_DIR} \
100-
&& make -j$(nproc) \
101-
&& sudo make install \
102-
&& cd / \
103-
&& sudo rm -rf /tmp/ompi
104-
105-
ENV PATH="${OMPI_INSTALL_DIR}/bin:${PATH}"
106-
ENV LD_LIBRARY_PATH="${OMPI_INSTALL_DIR}/lib:${UCX_INSTALL_DIR}/lib:/opt/rocm/lib"
107-
108-
109-
RUN cd /tmp \
110-
&& git clone https://github.com/ROCm/rocSHMEM.git \
111-
&& cd rocSHMEM \
112-
&& mkdir build \
113-
&& cd build \
114-
&& MPI_ROOT=${OMPI_INSTALL_DIR} UCX_ROOT=${UCX_INSTALL_DIR} CMAKE_PREFIX_PATH="${ROCM_PATH}:$CMAKE_PREFIX_PATH" \
115-
sudo ../scripts/build_configs/ipc_single -DCMAKE_INSTALL_PREFIX=/opt/rocshmem \
116-
&& cd / \
117-
&& sudo rm -rf /tmp/rocSHMEM
11858

11959

120-
ENV ROCSHMEM_INSTALL_DIR=${ROCSHMEM_INSTALL_DIR}
121-
ENV LD_LIBRARY_PATH="${ROCSHMEM_INSTALL_DIR}/lib:${LD_LIBRARY_PATH}"
60+
ENV LD_LIBRARY_PATH="/opt/rocm/lib"

src/libkernelbot/consts.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ class GitHubGPU(Enum):
2121
MI300 = "MI300"
2222
MI250 = "MI250"
2323
MI300x8 = "MI300x8"
24+
MI355X = "MI355X"
2425

2526

2627
class ModalGPU(Enum):
@@ -121,6 +122,7 @@ class RankCriterion(Enum):
121122
"MI300": None,
122123
"MI300x8": None,
123124
"MI250": None,
125+
"MI355X": None,
124126
}
125127

126128

@@ -153,6 +155,7 @@ class RankCriterion(Enum):
153155
AMD_REQUIREMENTS = """
154156
--index-url https://download.pytorch.org/whl/rocm6.2.4
155157
torch
158+
numpy
156159
"""
157160

158161
# A buffer for timeouts to account for github setup time

src/libkernelbot/launchers/github.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,8 @@ def get_timeout(config: dict) -> int:
5353
SubmissionMode.LEADERBOARD.value: config.get("ranked_timeout"),
5454
}
5555
seconds = sec_map.get(mode) or DEFAULT_GITHUB_TIMEOUT_MINUTES * 60
56-
return math.ceil(seconds / 60)
56+
minutes = math.ceil(seconds / 60)
57+
return max(minutes, DEFAULT_GITHUB_TIMEOUT_MINUTES)
5758

5859

5960
class GitHubLauncher(Launcher):
@@ -93,12 +94,13 @@ async def run_submission( # noqa: C901
9394
self, config: dict, gpu_type: GPU, status: RunProgressReporter
9495
) -> FullResult:
9596
gpu_vendor = None
96-
if gpu_type.value in ["MI300", "MI250", "MI300x8"]:
97+
if gpu_type.value in ["MI300", "MI250", "MI300x8", "MI355X"]:
9798
selected_workflow = "amd_workflow.yml"
9899
runner_name = {
99100
"MI300": "amdgpu-mi300-x86-64",
100101
"MI250": "amdgpu-mi250-x86-64",
101102
"MI300x8": "amdgpu-mi300-8-x86-64",
103+
"MI355X": "mia1-p02-g29",
102104
}[gpu_type.value]
103105
gpu_vendor = "AMD"
104106
requirements = AMD_REQUIREMENTS

0 commit comments

Comments
 (0)