Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
079212d
ci(actions): add bootstrap-buildkit composite action
saturley-hall May 18, 2026
7e44cf0
ci(actions): add release-metadata composite action
saturley-hall May 18, 2026
51e1ce2
ci: add pr.yaml workflow for pull-request/* mirror branches
saturley-hall May 18, 2026
9572126
ci: add post-merge.yaml workflow for main and release/* branches
saturley-hall May 18, 2026
7c73593
ci(nightly): refactor to single-buildx multi-platform; drop registry …
saturley-hall May 18, 2026
8cf5c92
ci: add rc.yaml workflow (promote-only, manual dispatch)
saturley-hall May 18, 2026
25fe3ca
chore: remove stale ATTRIBUTIONS-Python.md and ATTRIBUTIONS-container.md
saturley-hall May 18, 2026
daa9ae4
docs(contributing): document the four release pipelines
saturley-hall May 18, 2026
cdabbd8
ci(fix): stop exporting container_image/container_ref as job outputs
saturley-hall May 18, 2026
88f1498
Merge branch 'main' into harrison/improve-ci-from-release-automation
saturley-hall May 18, 2026
8ece9df
feat(pyproject): canonical [project.urls] (Homepage / Repository / Is…
saturley-hall May 18, 2026
3b5683e
ci: extract commit-SHA embed to tools/embed_commit_sha.py
saturley-hall May 18, 2026
40ba3b4
ci(pr,post-merge): split into build + stage-wheel; jf rt upload
saturley-hall May 18, 2026
455e482
ci(actions): bootstrap-buildkit K8s fallback uses purpose=build
saturley-hall May 18, 2026
63dde60
ci(actions): bootstrap-buildkit K8s fallback tolerates buildkit-worker
saturley-hall May 18, 2026
b1fb143
ci: mark tools/embed_commit_sha.py executable
saturley-hall May 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 179 additions & 0 deletions .github/actions/bootstrap-buildkit/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
name: 'Bootstrap BuildKit'
description: 'Bootstrap a multi-arch buildx builder using the centralized aiperf-builder pods (remote driver) with Kubernetes driver fallback'

# Connects to the aiperf-builder StatefulSet pods at index 0 for each
# requested architecture via headless service DNS. Falls back to the
# Kubernetes driver if the pods are unreachable.
#
# DNS pattern: buildkit-{arch}-0.buildkit-{arch}-headless.buildkit.svc.cluster.local
#
# Usage:
# - uses: ./.github/actions/bootstrap-buildkit
# with:
# builder_name: my-builder
# arch: 'linux/amd64,linux/arm64'
#
# Cleanup usage (skip_bootstrap: true re-registers the builder so it can be removed):
# - uses: ./.github/actions/bootstrap-buildkit
# with:
# builder_name: my-builder
# skip_bootstrap: 'true'

inputs:
builder_name:
description: 'Name for the buildx builder'
required: true
arch:
description: 'Comma-separated Docker platform(s): linux/amd64, linux/arm64, or linux/amd64,linux/arm64'
required: false
default: 'linux/amd64,linux/arm64'
Comment on lines +28 to +31
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize arch tokens and require full resolution before taking the remote path.

The input docs allow linux/amd64, linux/arm64, but the parser never trims per-token whitespace, so the second entry becomes arm64 and generates invalid DNS/platform strings. Also, worker_addresses != '' accepts a partially resolved set, so a requested multi-arch build can proceed with only one worker instead of falling back. Trim each token before use and only emit worker_addresses when every requested arch resolves.

Suggested fix
         WORKER_ADDRS=""
+        RESOLVED_COUNT=0
         IFS=',' read -ra ARCHS <<< "$ARCH"
         for arch in "${ARCHS[@]}"; do
+          arch="${arch//[[:space:]]/}"
           DNS="buildkit-${arch}-0.buildkit-${arch}-headless.${NAMESPACE}.svc.cluster.local"
           if nslookup "$DNS" >/dev/null 2>&1; then
             WORKER_ADDRS="${WORKER_ADDRS:+${WORKER_ADDRS},}tcp://${DNS}:${PORT}"
+            RESOLVED_COUNT=$((RESOLVED_COUNT + 1))
             echo "Resolved ${arch} worker: ${DNS}"
           else
             echo "No DNS for ${arch} worker (${DNS})"
           fi
         done
+
+        if [ "${RESOLVED_COUNT}" -ne "${`#ARCHS`[@]}" ]; then
+          WORKER_ADDRS=""
+        fi
         FIRST=true
         IFS=',' read -ra ARCHS <<< "$ARCH"
         for arch in "${ARCHS[@]}"; do
+          arch="${arch//[[:space:]]/}"
           # Comma-containing values (nodeselector, tolerations) must be wrapped in

Also applies to: 84-100, 112-127, 134-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/bootstrap-buildkit/action.yml around lines 28 - 31,
Normalize and trim each comma-separated token from the arch input (the 'arch'
input value) before generating platform/DNS strings and when mapping to workers;
ensure you call .trim() (or equivalent) on each token and normalize forms like
"arm64" to "linux/arm64" consistently. When computing worker_addresses, only
emit/set the output (worker_addresses) if every requested arch token
successfully resolves to a worker (i.e., require full resolution for all
requested arches), otherwise leave worker_addresses empty or fall back; apply
the same trimming+full-resolution guard to the other blocks that build
platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping
sections).

skip_bootstrap:
description: 'Skip the bootstrap step — only re-registers the builder context. Use in cleanup jobs.'
required: false
default: 'false'
fresh_builder:
description: 'Force removal and recreation of the builder before bootstrapping. Use in pre-warm jobs.'
required: false
default: 'false'

# Kubernetes fallback driver options
namespace:
description: 'Kubernetes namespace for fallback buildkit pods'
required: false
default: 'buildkit'
replicas:
description: 'Number of fallback buildkit replicas'
required: false
default: '1'
requests_cpu:
description: 'CPU requests for fallback buildkit pods'
required: false
default: '12'
requests_memory:
description: 'Memory requests for fallback buildkit pods'
required: false
default: '26Gi'
limits_memory:
description: 'Memory limits for fallback buildkit pods'
required: false
default: '29Gi'
tolerations:
description: 'Tolerations for fallback buildkit pods'
required: false
default: 'key=buildkit-worker,value=true,operator=Equal,effect=NoSchedule'

runs:
using: "composite"
steps:
- name: Ensure nslookup is available
shell: bash
run: |
if ! command -v nslookup >/dev/null 2>&1; then
sudo apt-get update -qq && sudo apt-get install -y -qq dnsutils
fi

- name: Resolve worker addresses
id: resolve
shell: bash
run: |
NAMESPACE="buildkit"
PORT=1234

# Strip linux/ prefix to get bare arch names
ARCH="${{ inputs.arch }}"
ARCH="${ARCH//linux\//}"

WORKER_ADDRS=""
IFS=',' read -ra ARCHS <<< "$ARCH"
for arch in "${ARCHS[@]}"; do
DNS="buildkit-${arch}-0.buildkit-${arch}-headless.${NAMESPACE}.svc.cluster.local"
if nslookup "$DNS" >/dev/null 2>&1; then
WORKER_ADDRS="${WORKER_ADDRS:+${WORKER_ADDRS},}tcp://${DNS}:${PORT}"
echo "Resolved ${arch} worker: ${DNS}"
else
echo "No DNS for ${arch} worker (${DNS})"
fi
done

echo "worker_addresses=${WORKER_ADDRS}" >> "$GITHUB_OUTPUT"

- name: Handle fresh builder
if: inputs.fresh_builder == 'true'
shell: bash
run: |
if docker buildx inspect "${{ inputs.builder_name }}" > /dev/null 2>&1; then
echo "Forcing fresh builder: removing existing '${{ inputs.builder_name }}'."
docker buildx rm "${{ inputs.builder_name }}" || true
fi

- name: Create builder (remote driver)
if: steps.resolve.outputs.worker_addresses != ''
shell: bash
run: |
IFS=',' read -ra ADDRS <<< "${{ steps.resolve.outputs.worker_addresses }}"
FIRST=true
for addr in "${ADDRS[@]}"; do
if $FIRST; then
docker buildx create --use --name "${{ inputs.builder_name }}" --driver remote "$addr"
FIRST=false
else
docker buildx create --append --name "${{ inputs.builder_name }}" --driver remote "$addr"
fi
done

- name: Create builder (Kubernetes fallback)
if: steps.resolve.outputs.worker_addresses == ''
shell: bash
run: |
echo "::warning::Remote aiperf-builder pods unreachable — falling back to Kubernetes driver"
echo "## Fallback Build Warning" >> $GITHUB_STEP_SUMMARY
echo "aiperf-builder pods unavailable. Running on fallback Kubernetes pod. Please alert the ops team." >> $GITHUB_STEP_SUMMARY

# Strip linux/ prefix
ARCH="${{ inputs.arch }}"
ARCH="${ARCH//linux\//}"

FIRST=true
IFS=',' read -ra ARCHS <<< "$ARCH"
for arch in "${ARCHS[@]}"; do
# Comma-containing values (nodeselector, tolerations) must be wrapped in
# inner double quotes so buildx's CSV parser treats them as a single field
# rather than splitting on commas into separate driver options.
OPTS=(
'--driver-opt=namespace=${{ inputs.namespace }}'
'--driver-opt=loadbalance=sticky'
'--driver-opt=replicas=${{ inputs.replicas }}'
'--driver-opt=requests.cpu=${{ inputs.requests_cpu }}'
'--driver-opt=requests.memory=${{ inputs.requests_memory }}'
'--driver-opt=limits.memory=${{ inputs.limits_memory }}'
"--driver-opt=\"nodeselector=kubernetes.io/arch=${arch},purpose=build\""
'--driver-opt="tolerations=${{ inputs.tolerations }}"'
)
if $FIRST; then
docker buildx create --use --name "${{ inputs.builder_name }}" \
--driver kubernetes --platform "linux/${arch}" "${OPTS[@]}"
FIRST=false
else
docker buildx create --append --name "${{ inputs.builder_name }}" \
--driver kubernetes --platform "linux/${arch}" "${OPTS[@]}"
fi
done

- name: Bootstrap builder
if: inputs.skip_bootstrap != 'true'
shell: bash
run: |
for i in 1 2 3; do
if docker buildx inspect "${{ inputs.builder_name }}" --bootstrap; then
echo "Bootstrap succeeded on attempt $i"
break
fi
if [ "$i" -eq 3 ]; then
echo "::error::Bootstrap failed after 3 attempts"
exit 1
fi
echo "::warning::Bootstrap attempt $i failed, retrying in 10s..."
sleep 10
done
131 changes: 0 additions & 131 deletions .github/actions/create-multiarch-manifest/action.yml

This file was deleted.

Loading
Loading