Skip to content

ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows#951

Open
saturley-hall wants to merge 16 commits into
mainfrom
harrison/improve-ci-from-release-automation
Open

ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows#951
saturley-hall wants to merge 16 commits into
mainfrom
harrison/improve-ci-from-release-automation

Conversation

@saturley-hall
Copy link
Copy Markdown
Member

@saturley-hall saturley-hall commented May 18, 2026

Summary

Introduces three new build pipelines (pr, post-merge, rc) and refactors nightly to use the same dynamo-style multi-arch buildx pattern. Brings the K8s-pod-attached buildx pattern from PR #820 (and ultimately from ai-dynamo/dynamo) into main, and gives every code path — PR, post-merge to main, post-merge to release branches, nightly, and release candidate — a defined artifact destination.

PR #820's attribution work is already on main (Dockerfile's python-licenses stage); the hand-maintained ATTRIBUTIONS-*.md files at the repo root are deleted as part of this change.

What each workflow does

Workflow Trigger Wheel destination Container destination
pr.yaml push to pull-request/* Artifactory pr/pr<N>/ ECR aiperf:pr-<N>-<short_sha>
post-merge.yaml push to main or release/* Artifactory post-merge/<run_id>/ ECR aiperf:main-<short_sha> or aiperf:release-<X.Y.Z>-<short_sha>
nightly.yml schedule Mon–Fri 09:00 UTC or workflow_dispatch Artifactory nightly/ ECR aiperf:nightly-<YYYYMMDD>-<short_sha> → NGC staging
rc.yaml workflow_dispatch (manual) Artifactory rc/<X.Y.Z>/rc<N>/ ECR (source) → NGC staging aiperf:<X.Y.Z>-rc<N>

Wheel versioning rules

  • PR: <base>.dev0+pr<N>.g<short_sha> — PEP 440 local-version label.
  • Post-merge on main: <base>a<run_number>+g<short_sha> — alpha pre-release.
  • Post-merge on release/<X.Y.Z>: literal version = from pyproject.toml, untouched. Release branches own their version via the project's shipping process — the release captain commits 1.2.3, 1.2.3rc0, 1.2.3.dev0, etc., and the pipeline never rewrites it. The post-merge artifactory subpath disambiguates between commits.
  • Nightly: <base>.dev<YYYYMMDD> (unchanged).
  • RC: same as the post-merge wheel it's promoting (no rebuild). RC-ness is encoded in the artifactory path, not the filename.

RC is promote-only

rc.yaml does not rebuild. It takes a commit_sha input, discovers the post-merge.yaml run that built that commit via gh api, then:

  • crane copy the multi-arch image from ECR aiperf:release-<X.Y.Z>-<short_sha> → NGC aiperf:<X.Y.Z>-rc<N>.
  • Copy the wheel within Artifactory: post-merge/<run_id>/<filename>rc/<X.Y.Z>/rc<N>/<filename>.
  • Trigger the GitLab security scan (same payload shape as nightly).

That means the wheel inside the container and on Artifactory is byte-identical to what post-merge produced; RC is purely a re-tag + re-publish. If you need a different wheel version for RC, commit it on the release branch first, push (post-merge builds it), then promote.

Environment gates

Environment Used by Purpose
automated-release nightly.yml staging, post-merge.yaml stage-wheel, rc.yaml staging Existing env. Holds NGC + Artifactory + GitLab secrets; gates artifact promotion to public destinations.
manual-release-approval rc.yaml approve job New env, must be created in Settings > Environments before the first RC dispatch. Single human gate that unblocks the RC's promotion jobs.

The RC pipeline structure is: approve (manual-release-approval) → three staging jobs (automated-release) → summary. If automated-release has required reviewers configured, each staging job will independently prompt — that's a GitHub Actions UX wart, not a workflow issue.

Dynamo buildx pattern

New composite action .github/actions/bootstrap-buildkit/ (ported from ai-dynamo/dynamo's same-named action) attaches docker buildx to per-arch BuildKit pods reachable via headless K8s DNS (buildkit-{arch}-0.buildkit-{arch}-headless.buildkit.svc.cluster.local:1234), with a Kubernetes-driver fallback. One docker buildx build --platform linux/amd64,linux/arm64 --push against that builder produces the OCI index natively — no more per-arch jobs + manifest stitching.

Used by pr.yaml, post-merge.yaml, and the refactored nightly.yml. RC doesn't build, so it doesn't use this.

All callers pass fresh_builder: 'true' and a github.run_id-keyed builder name to eliminate stale-context collisions across runs.

Cache strategy per workflow

  • pr.yaml: --cache-to writes to aiperf:cache-pr-<N> only (PRs never write to other flavors' caches; concurrent PRs cannot blow each other's manifests away). --cache-from reads cache-pr-<N> first, then falls back to cache-post-merge so a brand-new PR's first build gets warm layers from main.
  • post-merge.yaml: shared aiperf:cache-post-merge (read + write).
  • nightly.yml: no registry cache. Nightly is the rigorous build that revalidates the Dockerfile from scratch against the BuildKit pod's persistent layer cache only. Drift in pinned base images, system packages, or transitive deps surfaces here rather than being masked by a stale registry cache.
  • rc.yaml: doesn't build; N/A.

Files deleted

  • ATTRIBUTIONS-Python.md, ATTRIBUTIONS-container.md — superseded by Dockerfile's python-licenses stage; auto-generated and shipped inside the container at /licenses/.
  • .github/actions/create-multiarch-manifest/ — buildx now produces the manifest natively, no consumers left.

Before merging

  1. Create the manual-release-approval GitHub environment under Settings > Environments. Configure the release reviewer list. This must exist before any RC dispatch can succeed.
  2. Verify the automated-release env has the secrets RC needs: NGC_PUBLISH_TOKEN, NGC_PUBLISH_USERNAME, NGC_STAGING_IMAGE_BASE, ARTIFACTORY_URL, ARTIFACTORY_USER, ARTIFACTORY_TOKEN, ARTIFACTORY_PYPI_URL, GITLAB_TRIGGER_TOKEN, GITLAB_PIPELINE_URL. These already exist for nightly; RC reuses them.
  3. Consider an ECR lifecycle policy for aiperf:cache-pr-* images so closed PRs' caches expire after N days. Not blocking for merge.

Test plan

  • Push a no-op commit to a pull-request/0 branch; confirm pr.yaml produces pr/pr0/aiperf-...whl in artifactory and aiperf:pr-0-<sha> (multi-arch manifest) in ECR.
  • Merge a small change to main; confirm post-merge.yaml produces post-merge/<run_id>/aiperf-<base>a<run>+g<sha>-py3-none-any.whl and aiperf:main-<sha>; verify the automated-release gate pauses for approval.
  • Create a throwaway release/0.9.0 branch with version = "0.9.0" committed to pyproject.toml; push a small change; confirm the wheel filename is exactly aiperf-0.9.0-py3-none-any.whl (no alpha/rc suffix, no local-version label) — the workflow did not rewrite pyproject.toml.
  • Dispatch nightly.yml; confirm a single build job replaces the prior two-job + manifest-stitch topology, and docker buildx imagetools inspect aiperf:nightly-<date>-<sha> shows both amd64 and arm64 manifests in the OCI index.
  • After creating manual-release-approval, dispatch rc.yaml with a known commit SHA from a release branch; confirm the workflow finds the post-merge run via gh api, pauses on approve, then after approval copies ECR → NGC and Artifactory post-merge/...rc/<X.Y.Z>/rc<N>/....

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added CI workflows for post-merge, PR, nightly, and release-candidate pipelines with unified multi-arch container builds.
    • New actions for bootstrapping multi-arch BuildKit and for computing release metadata.
    • Added a build-time utility to embed commit SHAs into artifacts.
    • Package metadata updated with homepage/repo/issues URLs.
  • Documentation

    • Contributing guide updated with release pipeline docs and local validation steps.
  • Removed

    • Removed container attribution document and previous multi-arch manifest action.

Review Change Stack

saturley-hall and others added 8 commits May 17, 2026 21:22
Bootstraps a multi-arch buildx builder by attaching to per-arch BuildKit
pods in the buildkit namespace via headless K8s DNS, with a Kubernetes
driver fallback. Lifted from ai-dynamo/dynamo's same-named action; renamed
metadata and usage examples from the prior 'init-aiperf-builder' name.

This action lets the build workflows produce multi-arch images with a
single 'docker buildx build --platform linux/amd64,linux/arm64 --push'
instead of running separate per-arch jobs and stitching a manifest after.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Single source of truth for build-flavor version math. Given a flavor
(pr, post-merge-main, post-merge-release, nightly) and a commit SHA,
emits wheel_version, wheel_filename, container_tag, container_image,
container_ref, short_sha, artifactory_subpath, and skip_version_mutation.

post-merge-release is the one flavor that does not mutate pyproject.toml:
release branches own their own version via the project's shipping process,
so the action returns skip_version_mutation=true and the workflow short-
circuits update-pyproject-version.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Fires on pushes to pull-request/<N> branches (the copy-pr-bot mirror
convention already used by fern-docs.yml). Publishes:

  wheel  -> artifactory: pr/pr<N>/aiperf-<base>.dev0+pr<N>.g<sha>-py3-none-any.whl
  image  -> ECR:         aiperf:pr-<N>-<short_sha>  (multi-arch manifest)

Cache strategy: --cache-to writes only to a per-PR cache tag
(aiperf:cache-pr-<N>) so concurrent PRs don't blow each other's cache
manifests away. --cache-from reads first from cache-pr-<N>, then falls
back to cache-post-merge so a brand-new PR's first build still gets
warm layers from main.

Builder name uses github.run_id and fresh_builder: true so a canceled
prior run never leaves a stale buildx context that collides with the
next push.

No environment gate. PR artifacts are scratch space.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Fires on pushes to main and release/X.Y.Z. Flavor-conditional behavior:

  main:        wheel = <base>a<run_number>+g<short_sha>   (alpha pre-release)
               image = main-<short_sha>
  release/*:   wheel = <pyproject.toml version, untouched>
               image = release-<X.Y.Z>-<short_sha>

Release branches own their version via the project's shipping process;
the workflow skips update-pyproject-version when release_metadata's
skip_version_mutation output is true.

Wheel goes S3 (build) -> Artifactory post-merge/<run_id>/<filename>
(stage). The stage-wheel job is gated by the existing automated-release
GitHub environment, matching nightly's pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
…cache

Replaces the per-arch matrix + create-multiarch-manifest-ecr two-step
choreography with one job that runs:

  docker buildx build --platform linux/amd64,linux/arm64 --target runtime --push

against a multi-arch builder bootstrapped via bootstrap-buildkit. The
buildx invocation produces and pushes the OCI index natively, so
create-multiarch-manifest is deleted (no consumers remain).

stage-nightly-container is simplified to a single 'crane copy' of the
multi-arch index, instead of copying per-arch manifests and stitching
them with 'crane index append' on NGC.

Nightly intentionally runs WITHOUT --cache-from/--cache-to. Nightly is
the rigorous build that validates the Dockerfile from scratch (against
the BuildKit pod's persistent layer cache only). Drift in pinned base
images, system packages, or transitive Python deps now surfaces here
rather than being masked by a registry cache layer reused from a stale
PR or post-merge build.

bootstrap-buildkit is invoked with fresh_builder: true so a prior
canceled run or stale builder context cannot collide with a new run.

The 'needs:' chain and pipeline-summary job are updated to drop the
removed create-multiarch-manifest-ecr step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Manually-dispatched release-candidate pipeline. Does NOT rebuild.
Promotes an already-built post-merge artifact (from a release/X.Y.Z
branch) to RC locations on NGC staging and Artifactory.

Inputs: release_version (X.Y.Z), rc_number, commit_sha (40-char), and
skip_security_scan. The 'prepare' job validates inputs and uses gh api
to discover the post-merge.yaml run ID for the commit so the workflow
can locate the source wheel at Artifactory post-merge/<run_id>/.

Promotion topology:

  approve            (manual-release-approval env) - single human gate
  stage-container    (automated-release env)       - crane copy
                       ECR aiperf:release-<X.Y.Z>-<sha>
                       NGC aiperf:<X.Y.Z>-rc<N>
  stage-wheel        (automated-release env)       - artifactory copy
                       post-merge/<run_id>/<wheel>
                       rc/<X.Y.Z>/rc<N>/<wheel>
  trigger-gitlab-scan (automated-release env)      - mirrors nightly

Wheel keeps the post-merge filename (aiperf-<X.Y.Z>-py3-none-any.whl) -
the rc-ness is encoded in the artifactory path, consistent with the rule
that release branches own their version.

Requires the GitHub environment 'manual-release-approval' to be created
in Settings > Environments with the release reviewer list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
These files are now produced fresh on every container build by the
Dockerfile's python-licenses stage (pip-licenses + cyclonedx-py +
generate_python_attributions.py + generate_dpkg_attributions.py) and
ship inside the container at /licenses/python/ and /licenses/dpkg/.
The hand-maintained copies at the repo root were drifting from the
auto-generated truth and serve no purpose now that every release
flavor emits them as build artifacts.

ATTRIBUTIONS.md (the project's own license attribution file) is
unrelated and stays in place.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Adds a Release Pipelines section to CONTRIBUTING.md covering pr,
post-merge, nightly, and rc workflows: triggers, wheel destinations,
container destinations, wheel-versioning rules per flavor, and the two
GitHub environment gates (automated-release for nightly/post-merge
staging, manual-release-approval for rc).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b1fb143aad77ff93b5db0566b654d7d177520348

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b1fb143aad77ff93b5db0566b654d7d177520348

Last updated for commit: b1fb143Browse code

@github-actions github-actions Bot added the ci label May 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Warning

Rate limit exceeded

@saturley-hall has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a2ad1c81-f452-40a4-8315-a3cade30e310

📥 Commits

Reviewing files that changed from the base of the PR and between 40ba3b4 and b1fb143.

📒 Files selected for processing (2)
  • .github/actions/bootstrap-buildkit/action.yml
  • tools/embed_commit_sha.py

Walkthrough

Adds composite actions and workflows to centralize multi-arch buildx container and wheel builds, compute release metadata by flavor, implement nightly/pr/post-merge/RC pipelines, update documentation, and add a commit-SHA embedding helper.

Changes

Release and Build Pipeline Workflows

Layer / File(s) Summary
Bootstrap BuildKit composite action
.github/actions/bootstrap-buildkit/action.yml
New composite action that configures multi-arch docker buildx builder by resolving DNS-based remote worker addresses and falling back to Kubernetes driver, with fresh builder cleanup and retry-based bootstrap logic.
Release metadata computation action
.github/actions/release-metadata/action.yml
New composite action that validates commit SHA and computes flavor-specific (pr, post-merge-main, post-merge-release, nightly) wheel versions, container tags, Artifactory paths, and version mutation flags from pyproject.toml.
Nightly workflow multi-arch refactor
.github/workflows/nightly.yml
Refactored nightly pipeline to replace per-architecture matrix builds and separate manifest creation with single unified buildx multi-arch build; removes dependency on deleted create-multiarch-manifest action and updates staging logic to assume manifest already exists from buildx push.
PR build workflow
.github/workflows/pr.yaml
New workflow triggered on pull-request/* pushes that parses PR number, computes metadata, builds amd64-only wheel and multi-arch (amd64/arm64) container, and uploads artifacts to ECR and Artifactory (wheel staging gated).
Post-merge workflow for main and release branches
.github/workflows/post-merge.yaml
New workflow triggered on pushes to main and release/* that computes release flavor, conditionally patches version, builds wheel and multi-arch container, validates versions, and uploads wheel to Artifactory via gated environment.
Release candidate promotion workflow
.github/workflows/rc.yaml
New manual workflow_dispatch workflow for RC promotion that validates inputs, discovers prior post-merge artifacts, copies container from ECR to NGC, stages wheel to RC Artifactory path, and optionally triggers GitLab security scans.
Documentation and misc
CONTRIBUTING.md, pyproject.toml, tools/embed_commit_sha.py
Added "Release Pipelines" docs, added [project.urls] metadata, and added a CI helper to embed commit SHA into src/aiperf/_build_info.py.

🎯 3 (Moderate) | ⏱️ ~25 minutes


🐇 I hopped through builds and tags tonight,
Poked DNS for workers, set builders right,
Flavors aligned—pr, nightly, release in sight,
Wheels and indices pushed on multi-arch flight,
My tiny paws stamped metadata just right. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: introduction of new CI workflows (PR/post-merge/RC pipelines) and adoption of a shared buildx worker pattern across build workflows.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/actions/bootstrap-buildkit/action.yml:
- Around line 28-31: Normalize and trim each comma-separated token from the arch
input (the 'arch' input value) before generating platform/DNS strings and when
mapping to workers; ensure you call .trim() (or equivalent) on each token and
normalize forms like "arm64" to "linux/arm64" consistently. When computing
worker_addresses, only emit/set the output (worker_addresses) if every requested
arch token successfully resolves to a worker (i.e., require full resolution for
all requested arches), otherwise leave worker_addresses empty or fall back;
apply the same trimming+full-resolution guard to the other blocks that build
platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping
sections).

In @.github/actions/release-metadata/action.yml:
- Around line 101-105: The BASE_VERSION extraction currently restricts to a bare
X.Y.Z and fails when pyproject.toml contains suffixes (e.g. 1.2.3rc1); change
the parsing in the block that sets BASE_VERSION to capture the entire quoted
version string (including any suffix) instead of only digits-dots (i.e., grab
the value between the quotes using the existing grep/sed pipeline), then allow
the downstream release-flavor logic to trim or interpret suffixes as needed;
make the same change in the analogous parsing block around the later section
(the block currently at lines referenced in the review) so both places read the
full quoted version into BASE_VERSION for flavor-specific handling.

In @.github/workflows/pr.yaml:
- Around line 21-24: The workflow currently triggers on branches matching
'pull-request/*' and checks out and runs untrusted PR code on the privileged
self-hosted runner (prod-aiperf-default-v1) and later publishes with
AWS/Artifactory credentials; change this by splitting the pipeline into two
workflows or adding an explicit approval gate: move the checkout/build steps for
PR branches into an unprivileged PR workflow that uses GitHub-hosted runners and
never uses production credentials, and create a separate trusted
promotion/publish workflow that only runs on protected branches or after a
manual approval event and runs on prod-aiperf-default-v1 with AWS/Artifactory
credentials; ensure references to the 'pull-request/*' trigger, any checkout
actions, and the publish steps that use AWS/Artifactory credentials are updated
accordingly.
- Around line 47-48: The workflow uses mutable tags for external actions
(actions/checkout@v4 and astral-sh/setup-uv@v5) which allows retagged code to
run with repository secrets; replace each mutable tag occurrence (including the
second actions/checkout usage and the astral-sh/setup-uv usage) with the
corresponding immutable commit SHA for the exact release you trust, updating the
uses entries to the full "owner/repo@sha" form so the workflow runs the pinned
action commits only.

In @.github/workflows/rc.yaml:
- Around line 118-125: The workflow lookup uses RESPONSE (gh api call) and
filters only by head_sha (${COMMIT_SHA}), which can match commits across
branches; update the gh api call that builds RESPONSE to also pass the release
branch filter (add -f "branch=${RELEASE_BRANCH}" or the repo’s release branch
variable) so the returned workflow_runs are constrained to the intended release
branch before extracting RUN_ID; ensure the branch variable you add is
defined/propagated in the workflow context.

In `@CONTRIBUTING.md`:
- Line 201: Replace the incorrect CLI invocation "gh act -W
.github/workflows/pr.yaml" with the standalone act command; locate the
occurrence in CONTRIBUTING.md and change it to use "act -W
.github/workflows/pr.yaml" (or simply "act" as shown later) so the documentation
references the nektos/act binary rather than a non-existent gh subcommand.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3dfbf66a-03c8-431e-bd4a-1f8553d1c39b

📥 Commits

Reviewing files that changed from the base of the PR and between bac953d and daa9ae4.

📒 Files selected for processing (10)
  • .github/actions/bootstrap-buildkit/action.yml
  • .github/actions/create-multiarch-manifest/action.yml
  • .github/actions/release-metadata/action.yml
  • .github/workflows/nightly.yml
  • .github/workflows/post-merge.yaml
  • .github/workflows/pr.yaml
  • .github/workflows/rc.yaml
  • ATTRIBUTIONS-Python.md
  • ATTRIBUTIONS-container.md
  • CONTRIBUTING.md
💤 Files with no reviewable changes (2)
  • ATTRIBUTIONS-container.md
  • .github/actions/create-multiarch-manifest/action.yml

Comment on lines +28 to +31
arch:
description: 'Comma-separated Docker platform(s): linux/amd64, linux/arm64, or linux/amd64,linux/arm64'
required: false
default: 'linux/amd64,linux/arm64'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize arch tokens and require full resolution before taking the remote path.

The input docs allow linux/amd64, linux/arm64, but the parser never trims per-token whitespace, so the second entry becomes arm64 and generates invalid DNS/platform strings. Also, worker_addresses != '' accepts a partially resolved set, so a requested multi-arch build can proceed with only one worker instead of falling back. Trim each token before use and only emit worker_addresses when every requested arch resolves.

Suggested fix
         WORKER_ADDRS=""
+        RESOLVED_COUNT=0
         IFS=',' read -ra ARCHS <<< "$ARCH"
         for arch in "${ARCHS[@]}"; do
+          arch="${arch//[[:space:]]/}"
           DNS="buildkit-${arch}-0.buildkit-${arch}-headless.${NAMESPACE}.svc.cluster.local"
           if nslookup "$DNS" >/dev/null 2>&1; then
             WORKER_ADDRS="${WORKER_ADDRS:+${WORKER_ADDRS},}tcp://${DNS}:${PORT}"
+            RESOLVED_COUNT=$((RESOLVED_COUNT + 1))
             echo "Resolved ${arch} worker: ${DNS}"
           else
             echo "No DNS for ${arch} worker (${DNS})"
           fi
         done
+
+        if [ "${RESOLVED_COUNT}" -ne "${`#ARCHS`[@]}" ]; then
+          WORKER_ADDRS=""
+        fi
         FIRST=true
         IFS=',' read -ra ARCHS <<< "$ARCH"
         for arch in "${ARCHS[@]}"; do
+          arch="${arch//[[:space:]]/}"
           # Comma-containing values (nodeselector, tolerations) must be wrapped in

Also applies to: 84-100, 112-127, 134-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/bootstrap-buildkit/action.yml around lines 28 - 31,
Normalize and trim each comma-separated token from the arch input (the 'arch'
input value) before generating platform/DNS strings and when mapping to workers;
ensure you call .trim() (or equivalent) on each token and normalize forms like
"arm64" to "linux/arm64" consistently. When computing worker_addresses, only
emit/set the output (worker_addresses) if every requested arch token
successfully resolves to a worker (i.e., require full resolution for all
requested arches), otherwise leave worker_addresses empty or fall back; apply
the same trimming+full-resolution guard to the other blocks that build
platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping
sections).

Comment on lines +101 to +105
BASE_VERSION=$(grep -m1 -E '^version = "[0-9]+\.[0-9]+\.[0-9]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')
if [[ -z "${BASE_VERSION}" ]]; then
echo "::error::Could not parse base version from pyproject.toml"
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The base-version parser rejects suffixes that release branches are supposed to preserve.

BASE_VERSION only matches bare X.Y.Z, but the post-merge-release path explicitly says release branches may keep rc/post/dev suffixes from pyproject.toml. A branch carrying 1.2.3rc1 will fail at Line 101 before it ever reaches the release-flavor logic. Parse the full quoted version here, then let each flavor decide how much of it to use.

Suggested fix
-        BASE_VERSION=$(grep -m1 -E '^version = "[0-9]+\.[0-9]+\.[0-9]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')
+        BASE_VERSION=$(grep -m1 -E '^version = "[^"]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')

Also applies to: 132-145

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/release-metadata/action.yml around lines 101 - 105, The
BASE_VERSION extraction currently restricts to a bare X.Y.Z and fails when
pyproject.toml contains suffixes (e.g. 1.2.3rc1); change the parsing in the
block that sets BASE_VERSION to capture the entire quoted version string
(including any suffix) instead of only digits-dots (i.e., grab the value between
the quotes using the existing grep/sed pipeline), then allow the downstream
release-flavor logic to trim or interpret suffixes as needed; make the same
change in the analogous parsing block around the later section (the block
currently at lines referenced in the review) so both places read the full quoted
version into BASE_VERSION for flavor-specific handling.

Comment thread .github/workflows/pr.yaml
Comment thread .github/workflows/pr.yaml
Comment on lines +47 to +48
- name: Checkout
uses: actions/checkout@v4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/pr.yaml

Repository: ai-dynamo/aiperf

Length of output: 11991


Pin external actions to immutable SHAs to prevent privilege escalation.

This workflow has access to publishing credentials (AWS ECR and Artifactory tokens), but external actions actions/checkout@v4 (lines 48, 78) and astral-sh/setup-uv@v5 (line 81) use mutable version tags. If these actions are retagged or compromised upstream, they would inherit access to these secrets. Replace with full commit SHAs to ensure only the intended action code runs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/pr.yaml around lines 47 - 48, The workflow uses mutable
tags for external actions (actions/checkout@v4 and astral-sh/setup-uv@v5) which
allows retagged code to run with repository secrets; replace each mutable tag
occurrence (including the second actions/checkout usage and the
astral-sh/setup-uv usage) with the corresponding immutable commit SHA for the
exact release you trust, updating the uses entries to the full "owner/repo@sha"
form so the workflow runs the pinned action commits only.

Comment thread .github/workflows/rc.yaml
Comment on lines +118 to +125
RESPONSE=$(gh api -X GET \
"/repos/${GITHUB_REPOSITORY}/actions/workflows/post-merge.yaml/runs" \
-f "head_sha=${COMMIT_SHA}" \
-f "status=success" \
-f "per_page=10")

RUN_ID=$(echo "${RESPONSE}" | jq -r '.workflow_runs | sort_by(.created_at) | reverse | .[0].id // empty')
if [ -z "${RUN_ID}" ]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

GitHub REST API (workflow runs): Which query parameter should be used to filter runs by branch (branch/head_branch) when calling /actions/workflows/{workflow_id}/runs?

💡 Result:

To filter workflow runs by branch when calling the GitHub REST API endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs, you should use the branch query parameter [1]. The branch parameter accepts the name of the branch as a string [1]. When provided, it returns only the workflow runs associated with that specific branch [1]. Example usage: GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs?branch=main Note that while some GitHub API contexts may refer to head branches or references, the specific query parameter for this endpoint is consistently documented as branch [2][3][1].

Citations:


🌐 Web query:

GitHub REST API workflow runs endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs parameters documentation

💡 Result:

The GitHub REST API endpoint to list workflow runs for a specific workflow is: GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs This endpoint allows you to retrieve a list of all runs for a particular workflow [1]. The {workflow_id} path parameter can be either the unique integer ID of the workflow or its file name (e.g., main.yaml) [1][2]. Common query parameters supported by this endpoint include: - actor (string): Filters results by the user who triggered the workflow run [1]. - branch (string): Filters results by the branch name [1]. - event (string): Filters results by the event that triggered the workflow (e.g., push, pull_request) [1]. - status (string): Filters results by the status or conclusion of the run (e.g., success, failure, in_progress) [3][4]. - per_page (integer): The number of results to return per page (max 100) [5]. - page (integer): The page number of the results to fetch [5]. - created (string): Filters runs created within a specific date-time range [3][4]. - exclude_pull_requests (boolean): If true, omits pull request-related runs from the response [3][4]. - check_suite_id (integer): Filters runs associated with a specific check suite [3][4]. - head_sha (string): Filters runs associated with a specific commit SHA [3][4]. For authentication, if the repository is private, you must use an access token with the repo scope [1]. Anyone with read access to the repository can use this endpoint [1].

Citations:


Constrain post-merge run discovery to the intended release branch.

Run lookup filters only by head_sha, so the same commit hash on multiple branches (e.g., main and release/) can cause the wrong artifact set to be selected for RC promotion.

Add the branch parameter to filter by the release branch:

Proposed fix
       - name: Discover post-merge run for commit
         id: discover
         env:
           GH_TOKEN:   ${{ github.token }}
           COMMIT_SHA: ${{ inputs.commit_sha }}
+          VERSION:    ${{ inputs.release_version }}
         run: |
           set -euo pipefail
           RESPONSE=$(gh api -X GET \
             "/repos/${GITHUB_REPOSITORY}/actions/workflows/post-merge.yaml/runs" \
             -f "head_sha=${COMMIT_SHA}" \
+            -f "branch=release/${VERSION}" \
             -f "status=success" \
             -f "per_page=10")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/rc.yaml around lines 118 - 125, The workflow lookup uses
RESPONSE (gh api call) and filters only by head_sha (${COMMIT_SHA}), which can
match commits across branches; update the gh api call that builds RESPONSE to
also pass the release branch filter (add -f "branch=${RELEASE_BRANCH}" or the
repo’s release branch variable) so the returned workflow_runs are constrained to
the intended release branch before extracting RUN_ID; ensure the branch variable
you add is defined/propagated in the workflow context.

Comment thread CONTRIBUTING.md
yamllint .github/workflows/

# Smoke-test the bootstrap-buildkit action against a local docker daemon:
gh act -W .github/workflows/pr.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify act is standalone, not a gh subcommand

# Check if act is available as standalone command
if command -v act &> /dev/null; then
    echo "✓ act is available as standalone command"
    act --version
else
    echo "✗ act not found"
fi

# Check if gh has act subcommand
if gh act --help &> /dev/null 2>&1; then
    echo "✓ gh act subcommand exists"
else
    echo "✗ gh act subcommand does not exist (expected)"
fi

Repository: ai-dynamo/aiperf

Length of output: 122


🏁 Script executed:

cat -n CONTRIBUTING.md | sed -n '190,215p'

Repository: ai-dynamo/aiperf

Length of output: 1136


🏁 Script executed:

cat CONTRIBUTING.md | grep -E '^\s*[|+\-*]|^\s*[^a-zA-Z0-9#\s\`]' | head -30

Repository: ai-dynamo/aiperf

Length of output: 84


🏁 Script executed:

# Search for ASCII art diagrams (non-table lines with repeated special chars)
rg -n '^\s*[\+\-]{2,}|^\s*[|]{1,}\s*[|\-\s]{5,}' CONTRIBUTING.md

Repository: ai-dynamo/aiperf

Length of output: 135


🏁 Script executed:

# Check for actual ASCII art (box drawings, flow diagrams)
rg -n '^\s*\+[-=]+\+|^\s*[\|/\\][^\w]|^\s*[╔═╗║╚╝]' CONTRIBUTING.md | head -20

Repository: ai-dynamo/aiperf

Length of output: 1256


🏁 Script executed:

# Check the full context around potential ASCII diagrams - look for lines with unusual patterns
cat CONTRIBUTING.md | grep -v '^|' | grep -v '^#' | grep -v '^\s*$' | grep -v '^\s*-' | head -40

Repository: ai-dynamo/aiperf

Length of output: 2539


🏁 Script executed:

# Check entire file for actual ASCII art diagrams (not tables)
# Look for box-drawing patterns or flow diagrams
wc -l CONTRIBUTING.md && echo "---" && \
rg -i 'diagram|ascii|flow|chart|architecture' CONTRIBUTING.md

Repository: ai-dynamo/aiperf

Length of output: 879


Incorrect command: gh act is not valid.

The GitHub CLI (gh) does not have an act subcommand. The act tool is a standalone binary from nektos/act. Line 209 below correctly shows usage as just act without the gh prefix.

Proposed fix
-gh act -W .github/workflows/pr.yaml
+act -W .github/workflows/pr.yaml
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gh act -W .github/workflows/pr.yaml
act -W .github/workflows/pr.yaml
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CONTRIBUTING.md` at line 201, Replace the incorrect CLI invocation "gh act -W
.github/workflows/pr.yaml" with the standalone act command; locate the
occurrence in CONTRIBUTING.md and change it to use "act -W
.github/workflows/pr.yaml" (or simply "act" as shown later) so the documentation
references the nektos/act binary rather than a non-existent gh subcommand.

saturley-hall and others added 2 commits May 17, 2026 21:58
GitHub Actions silently strips a job output whose value contains a
registered secret, to prevent leaking the secret to downstream jobs.
The release-metadata action's container_image and container_ref both
embed the ECR registry hostname (built from secrets.AWS_ACCOUNT_ID +
AWS_DEFAULT_REGION), so when pr.yaml and post-merge.yaml promoted them
to the prepare job's top-level outputs, GitHub dropped them. The
downstream build job then ran 'docker buildx build --tag ""' and failed
with 'invalid tag "": repository name must have at least one component'.

Observed in https://github.com/ai-dynamo/aiperf/actions/runs/26008818007
with annotations 'Skip output container_ref/container_image since it
may contain secret.'

Fix mirrors nightly.yml's pattern: only promote container_tag across
job boundaries (no secret), and reconstruct the full ECR ref locally
in each step that needs it from the AWS secret env vars the step
already has access to.

Changes:
- release-metadata action: remove container_image and container_ref
  outputs; drop the now-unused container_registry/container_repository
  inputs; document the constraint in a comment near the GITHUB_OUTPUT
  block.
- pr.yaml, post-merge.yaml: drop the two outputs from prepare's job
  outputs; pass container_tag instead and compute CONTAINER_REF locally
  in the build step ('CONTAINER_REF="${REGISTRY}/aiperf:${CONTAINER_TAG}"').
  Summary steps now show container_tag (which is what you'd grep for
  in ECR anyway).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>
@saturley-hall saturley-hall changed the title ci: PR/post-merge/RC pipelines + dynamo buildx pattern across all build workflows ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows May 18, 2026
saturley-hall and others added 3 commits May 18, 2026 03:04
…sues)

Repository in [project.urls] is meant to be the canonical source repo
URL per PEP 621 convention (matches numpy, pandas, requests, scipy, …
— all point at the repo root, not a specific commit). Until now the
nightly/PR/post-merge pipelines mutated this field per build to a
commit-specific URL, which is misuse: SBOM tools, pip show, and the
Artifactory UI all expect 'where do I clone this from?', not 'what
revision is this exact wheel?'.

This commit puts the canonical URLs statically in pyproject.toml so
they ship in every wheel's METADATA without per-commit rewriting:

  Homepage   = "https://github.com/ai-dynamo/aiperf"
  Repository = "https://github.com/ai-dynamo/aiperf"
  Issues     = "https://github.com/ai-dynamo/aiperf/issues"

Per-commit traceability still lives where it belongs:
  - the wheel's version string carries the short SHA via PEP 440
    local-version (e.g. 0.8.0.dev0+pr951.g88f1498)
  - aiperf.__commit_sha__ carries the full SHA at runtime (via
    src/aiperf/_build_info.py, which CI continues to generate)

A follow-up commit removes the [project.urls] mutation from the
embed-commit-sha CI script so this field never gets rewritten.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Replaces the inline Python heredoc in the nightly build-container job
(soon also in pr.yaml and post-merge.yaml — separate commit) with a
single script call:

    python3 tools/embed_commit_sha.py "\${{ github.sha }}"

The script writes src/aiperf/_build_info.py so aiperf.__commit_sha__
returns the full SHA at runtime. It does NOT mutate pyproject.toml —
the canonical Repository URL is already static there (per the
preceding commit). The script lives in tools/ so it is a build-time
utility, not packaged inside the wheel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Restructures pr.yaml and post-merge.yaml so wheel publication to
Artifactory runs in its own job, gated by the existing automated-release
GitHub environment, and uses the JFrog CLI instead of curl -T.

WHY
  PR wheel filenames contain a '+' (PEP 440 local-version label, e.g.
  aiperf-0.8.0.dev0+pr951.g88f1498-py3-none-any.whl). Artifactory's
  generic-repo deploy handler rejected the curl PUT URL with HTTP 405,
  observed in run 26009595857 on PR #951. Same latent issue applies to
  post-merge-on-main wheels (which also use '+' via the alpha local
  label). 'jf rt upload' URL-encodes wheel filenames correctly and uses
  Artifactory's documented deploy endpoint.

  Putting the upload in its own job lets only that step run under the
  automated-release env (the build job stays ungated for fast iteration)
  and re-uses the post-merge two-job pattern that already worked.

HOW
  build (no env, prod-aiperf-default-v1):
    Pushes the multi-arch runtime image to ECR. Populates cache-pr-<N>
    (or cache-post-merge) as a side effect of mode=max.

  stage-wheel (env: automated-release, prod-aiperf-default-v1):
    1. Re-applies the same pyproject.toml + _build_info.py mutations
       build performed, so the wheel-builder cache key matches.
    2. Attaches to the same builder_name as build, so it lands on the
       same K8s BuildKit pods (StatefulSet, shared PV cache).
    3. Runs `buildx --target wheel-artifact --output type=local`
       with --cache-from only (no --cache-to). Every layer is hot
       from the build job, so only the scratch-export step actually
       runs; the wheel materializes locally as a near-instant op.
    4. Validates the wheel (twine check + venv install + version
       check + aiperf --help).
    5. Sets up the JFrog CLI with JF_URL derived by stripping
       '/artifactory/<repo>' from ARTIFACTORY_URL (keeps the existing
       secret format intact; nightly's curl path stays compatible).
    6. jf rt upload --flat to '<repo>/<artifactory_subpath>'.

  post-merge.yaml drops the prior S3 intermediate entirely — the
  buildx-cache extraction makes it redundant.

REQUIRED SECRET (must exist on the repo before this can run)
  ARTIFACTORY_REPO_NAME = sw-dynamo-aiperf-pypi-local
  ARTIFACTORY_URL stays as-is (no change needed); workflows derive
  the JFrog platform URL by stripping the repo suffix in shell.

NOT CHANGED
  nightly.yml and rc.yaml still use curl -T. Nightly wheels
  (.dev<YYYYMMDD>) and RC wheels (literal pyproject version) never
  contain '+', so they keep working without code changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/nightly.yml (1)

401-416: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Dead code: matrix.arch references in a non-matrix job will always skip these steps.

The build-container job was refactored from a matrix build to a single multi-platform buildx job, but several steps still reference matrix.arch == 'amd64'. Since there's no matrix defined, matrix.arch is undefined and these conditions evaluate to false, causing the steps to be silently skipped:

  • Line 402: Generate aiperf-nightly wheel variant
  • Line 424: Validate wheels
  • Line 472: Run unit tests against installed nightly wheel
  • Line 511: Upload wheels to S3

This means nightly wheels are never validated, tested, or uploaded.

🐛 Proposed fix: Remove the obsolete matrix conditionals
       - name: Generate aiperf-nightly wheel variant (amd64 only)
-        if: matrix.arch == 'amd64'
         run: |

Apply the same removal to lines 424, 472, and 511 (or rename the conditions to document they always run on the single amd64 wheel build).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly.yml around lines 401 - 416, The steps inside the
build-container job that are gated by the non-existent matrix variable are dead;
remove the obsolete "if: matrix.arch == 'amd64'" conditionals from the steps
named "Generate aiperf-nightly wheel variant", "Validate wheels", "Run unit
tests against installed nightly wheel" and "Upload wheels to S3" so they always
run in the single multi-platform buildx job (or replace the condition with an
appropriate runtime check if you intended to gate to amd64 specifically).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/embed_commit_sha.py`:
- Line 1: The script starting with the shebang line "#!/usr/bin/env python3" is
not executable which breaks the pre-commit hook; mark the file executable by
setting the executable bit (e.g., run chmod +x on the script or use git add
--chmod=+x when committing) so CI/pre-commit passes. Ensure the change is staged
and committed.
- Around line 27-36: The embed function currently calls
Path("src/aiperf/_build_info.py").write_text(...) without handling filesystem
errors; update embed to ensure the parent directory exists (create with
mkdir(parents=True, exist_ok=True)), then wrap the write_text call in a
try/except that catches OSError (and/or Exception), and on failure log or print
a clear error message including the exception details and the target path (use
build_info and sha), and exit/raise a controlled error rather than letting a raw
exception propagate; reference the embed function, the build_info Path variable,
and the write_text call when making the changes.
- Line 27: The embed(sha: str) function must validate its sha parameter before
use; add a guard that checks sha is a non-empty string and matches a reasonable
git SHA pattern (for example hex characters, 7–40 chars) and raise a clear
ValueError if it fails; update embed to perform this validation at the top
(validate non-empty and regex like ^[0-9a-fA-F]{7,40}$) so callers get an
explicit error instead of downstream failures.

---

Outside diff comments:
In @.github/workflows/nightly.yml:
- Around line 401-416: The steps inside the build-container job that are gated
by the non-existent matrix variable are dead; remove the obsolete "if:
matrix.arch == 'amd64'" conditionals from the steps named "Generate
aiperf-nightly wheel variant", "Validate wheels", "Run unit tests against
installed nightly wheel" and "Upload wheels to S3" so they always run in the
single multi-platform buildx job (or replace the condition with an appropriate
runtime check if you intended to gate to amd64 specifically).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98303f33-6afa-440e-8f8b-8bda853dcd8e

📥 Commits

Reviewing files that changed from the base of the PR and between daa9ae4 and 40ba3b4.

📒 Files selected for processing (6)
  • .github/actions/release-metadata/action.yml
  • .github/workflows/nightly.yml
  • .github/workflows/post-merge.yaml
  • .github/workflows/pr.yaml
  • pyproject.toml
  • tools/embed_commit_sha.py
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml

Comment thread tools/embed_commit_sha.py
@@ -0,0 +1,48 @@
#!/usr/bin/env python3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

The file has a shebang but is not marked executable, blocking CI.

The pre-commit hook failure indicates the file needs the executable bit set.

Fix by running:

chmod +x tools/embed_commit_sha.py

Or on Windows:

git add --chmod=+x tools/embed_commit_sha.py
🧰 Tools
🪛 GitHub Actions: Pre-commit / 0_pre-commit.txt

[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed (exit code 1): file has a shebang but is not marked executable. If intended to be executable, run 'chmod +x tools/embed_commit_sha.py' (or on Windows use 'git add --chmod=+x ...').

🪛 GitHub Actions: Pre-commit / pre-commit

[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed: file has a shebang but is not marked executable. Try: chmod +x tools/embed_commit_sha.py (or git add --chmod=+x on Windows).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/embed_commit_sha.py` at line 1, The script starting with the shebang
line "#!/usr/bin/env python3" is not executable which breaks the pre-commit
hook; mark the file executable by setting the executable bit (e.g., run chmod +x
on the script or use git add --chmod=+x when committing) so CI/pre-commit
passes. Ensure the change is staged and committed.

Comment thread tools/embed_commit_sha.py
Comment on lines +27 to +36
def embed(sha: str) -> None:
build_info = Path("src/aiperf/_build_info.py")
build_info.write_text(
"# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n"
"# SPDX-License-Identifier: Apache-2.0\n"
'"""Build-time metadata. Generated by CI; do not edit."""\n'
"\n"
f'COMMIT_SHA = "{sha}"\n'
)
print(f"Wrote {build_info} with COMMIT_SHA = {sha}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add error handling for filesystem operations.

The write_text() call can fail (permission denied, disk full, parent directory missing) and will raise an uncaught exception. This produces cryptic build failures instead of clear error messages.

🛡️ Proposed fix with error handling
 def embed(sha: str) -> None:
     build_info = Path("src/aiperf/_build_info.py")
-    build_info.write_text(
-        "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n"
-        "# SPDX-License-Identifier: Apache-2.0\n"
-        '"""Build-time metadata. Generated by CI; do not edit."""\n'
-        "\n"
-        f'COMMIT_SHA = "{sha}"\n'
-    )
-    print(f"Wrote {build_info} with COMMIT_SHA = {sha}")
+    try:
+        build_info.parent.mkdir(parents=True, exist_ok=True)
+        build_info.write_text(
+            "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n"
+            "# SPDX-License-Identifier: Apache-2.0\n"
+            '"""Build-time metadata. Generated by CI; do not edit."""\n'
+            "\n"
+            f'COMMIT_SHA = "{sha}"\n'
+        )
+        print(f"Wrote {build_info} with COMMIT_SHA = {sha}")
+    except OSError as e:
+        print(f"Failed to write {build_info}: {e}", file=sys.stderr)
+        raise SystemExit(1) from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/embed_commit_sha.py` around lines 27 - 36, The embed function currently
calls Path("src/aiperf/_build_info.py").write_text(...) without handling
filesystem errors; update embed to ensure the parent directory exists (create
with mkdir(parents=True, exist_ok=True)), then wrap the write_text call in a
try/except that catches OSError (and/or Exception), and on failure log or print
a clear error message including the exception details and the target path (use
build_info and sha), and exit/raise a controlled error rather than letting a raw
exception propagate; reference the embed function, the build_info Path variable,
and the write_text call when making the changes.

Comment thread tools/embed_commit_sha.py
from pathlib import Path


def embed(sha: str) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate the SHA parameter.

The function does not validate that sha is non-empty or has a reasonable format. While CI workflows are expected to pass valid commit SHAs, a basic check improves robustness and produces clearer errors if called incorrectly.

✅ Proposed validation check
 def embed(sha: str) -> None:
+    if not sha or not sha.strip():
+        raise ValueError("commit SHA cannot be empty")
     build_info = Path("src/aiperf/_build_info.py")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/embed_commit_sha.py` at line 27, The embed(sha: str) function must
validate its sha parameter before use; add a guard that checks sha is a
non-empty string and matches a reasonable git SHA pattern (for example hex
characters, 7–40 chars) and raise a clear ValueError if it fails; update embed
to perform this validation at the top (validate non-empty and regex like
^[0-9a-fA-F]{7,40}$) so callers get an explicit error instead of downstream
failures.

The K8s-driver fallback in bootstrap-buildkit was using
'nodeselector=...,purpose=aiperf-build' — a label inherited from the
dynamo-cluster bootstrap pattern. aiperf's cluster labels its builder
pool with 'purpose=build' instead, so the fallback path had nowhere
to schedule (confirmed via FailedScheduling events:
'purpose In [aiperf-build] not in purpose In [build]').

This is dormant under normal operation (the remote-driver path
attaches to the healthy buildkit-{amd64,arm64} StatefulSet pods and
the fallback never runs). It only matters when the StatefulSet pods
are temporarily unreachable — at which point the fallback should
actually be able to land. Observed in nightly run 26018678419 where
the StatefulSet was rolling and the fallback couldn't recover.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
@copy-pr-bot copy-pr-bot Bot had a problem deploying to automated-release May 18, 2026 21:29 Failure
saturley-hall and others added 2 commits May 18, 2026 17:31
Pairs with the preceding nodeselector fix. All 18 purpose=build nodes
in aiperf's cluster carry the taint buildkit-worker=true:NoSchedule
(verified via 'kubectl get nodes -l purpose=build -o custom-columns=
...,TAINTS:.spec.taints[*]'). The action's previous default tolerated
buildkit-fallback-worker — a key inherited from dynamo's two-pool
convention — which doesn't exist in aiperf's cluster.

With this change, the K8s-driver fallback can actually land on the
same nodes the buildkit StatefulSet runs on when the StatefulSet
itself is rolling or unreachable. The nodes are otherwise idle in
that condition (the StatefulSet pods are restarting / not consuming
resources), so co-tenancy on those nodes during outages is the
intended use of the fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The script has a shebang line (#!/usr/bin/env python3) but lacked the
executable bit, which trips pre-commit's check-shebang-scripts-are-
executable hook. Workflows invoke it as 'python3 tools/...' so the
mode bit isn't strictly required for CI, but the hook's invariant
holds: if a file has a shebang, declare it executable. chmod +x +
git update-index --chmod=+x persists the mode change in the tree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
@copy-pr-bot copy-pr-bot Bot had a problem deploying to automated-release May 18, 2026 21:36 Failure
@copy-pr-bot copy-pr-bot Bot deployed to automated-release May 18, 2026 21:47 Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant