ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows by saturley-hall · Pull Request #951 · ai-dynamo/aiperf

saturley-hall · 2026-05-18T01:32:42Z

Summary

Introduces three new build pipelines (pr, post-merge, rc) and refactors nightly to use the same dynamo-style multi-arch buildx pattern. Brings the K8s-pod-attached buildx pattern from PR #820 (and ultimately from ai-dynamo/dynamo) into main, and gives every code path — PR, post-merge to main, post-merge to release branches, nightly, and release candidate — a defined artifact destination.

PR #820's attribution work is already on main (Dockerfile's python-licenses stage); the hand-maintained ATTRIBUTIONS-*.md files at the repo root are deleted as part of this change.

What each workflow does

Workflow	Trigger	Wheel destination	Container destination
`pr.yaml`	push to `pull-request/*`	Artifactory `pr/pr<N>/`	ECR `aiperf:pr-<N>-<short_sha>`
`post-merge.yaml`	push to `main` or `release/*`	Artifactory `post-merge/<run_id>/`	ECR `aiperf:main-<short_sha>` or `aiperf:release-<X.Y.Z>-<short_sha>`
`nightly.yml`	schedule Mon–Fri 09:00 UTC or `workflow_dispatch`	Artifactory `nightly/`	ECR `aiperf:nightly-<YYYYMMDD>-<short_sha>` → NGC staging
`rc.yaml`	`workflow_dispatch` (manual)	Artifactory `rc/<X.Y.Z>/rc<N>/`	ECR (source) → NGC staging `aiperf:<X.Y.Z>-rc<N>`

Wheel versioning rules

PR: <base>.dev0+pr<N>.g<short_sha> — PEP 440 local-version label.
Post-merge on main: <base>a<run_number>+g<short_sha> — alpha pre-release.
Post-merge on release/<X.Y.Z>: literal version = from pyproject.toml, untouched. Release branches own their version via the project's shipping process — the release captain commits 1.2.3, 1.2.3rc0, 1.2.3.dev0, etc., and the pipeline never rewrites it. The post-merge artifactory subpath disambiguates between commits.
Nightly: <base>.dev<YYYYMMDD> (unchanged).
RC: same as the post-merge wheel it's promoting (no rebuild). RC-ness is encoded in the artifactory path, not the filename.

RC is promote-only

rc.yaml does not rebuild. It takes a commit_sha input, discovers the post-merge.yaml run that built that commit via gh api, then:

crane copy the multi-arch image from ECR aiperf:release-<X.Y.Z>-<short_sha> → NGC aiperf:<X.Y.Z>-rc<N>.
Copy the wheel within Artifactory: post-merge/<run_id>/<filename> → rc/<X.Y.Z>/rc<N>/<filename>.
Trigger the GitLab security scan (same payload shape as nightly).

That means the wheel inside the container and on Artifactory is byte-identical to what post-merge produced; RC is purely a re-tag + re-publish. If you need a different wheel version for RC, commit it on the release branch first, push (post-merge builds it), then promote.

Environment gates

Environment	Used by	Purpose
`automated-release`	`nightly.yml` staging, `post-merge.yaml` stage-wheel, `rc.yaml` staging	Existing env. Holds NGC + Artifactory + GitLab secrets; gates artifact promotion to public destinations.
`manual-release-approval`	`rc.yaml` `approve` job	New env, must be created in Settings > Environments before the first RC dispatch. Single human gate that unblocks the RC's promotion jobs.

The RC pipeline structure is: approve (manual-release-approval) → three staging jobs (automated-release) → summary. If automated-release has required reviewers configured, each staging job will independently prompt — that's a GitHub Actions UX wart, not a workflow issue.

Dynamo buildx pattern

New composite action .github/actions/bootstrap-buildkit/ (ported from ai-dynamo/dynamo's same-named action) attaches docker buildx to per-arch BuildKit pods reachable via headless K8s DNS (buildkit-{arch}-0.buildkit-{arch}-headless.buildkit.svc.cluster.local:1234), with a Kubernetes-driver fallback. One docker buildx build --platform linux/amd64,linux/arm64 --push against that builder produces the OCI index natively — no more per-arch jobs + manifest stitching.

Used by pr.yaml, post-merge.yaml, and the refactored nightly.yml. RC doesn't build, so it doesn't use this.

All callers pass fresh_builder: 'true' and a github.run_id-keyed builder name to eliminate stale-context collisions across runs.

Cache strategy per workflow

pr.yaml: --cache-to writes to aiperf:cache-pr-<N> only (PRs never write to other flavors' caches; concurrent PRs cannot blow each other's manifests away). --cache-from reads cache-pr-<N> first, then falls back to cache-post-merge so a brand-new PR's first build gets warm layers from main.
post-merge.yaml: shared aiperf:cache-post-merge (read + write).
nightly.yml: no registry cache. Nightly is the rigorous build that revalidates the Dockerfile from scratch against the BuildKit pod's persistent layer cache only. Drift in pinned base images, system packages, or transitive deps surfaces here rather than being masked by a stale registry cache.
rc.yaml: doesn't build; N/A.

Files deleted

ATTRIBUTIONS-Python.md, ATTRIBUTIONS-container.md — superseded by Dockerfile's python-licenses stage; auto-generated and shipped inside the container at /licenses/.
.github/actions/create-multiarch-manifest/ — buildx now produces the manifest natively, no consumers left.

Before merging

Create the manual-release-approval GitHub environment under Settings > Environments. Configure the release reviewer list. This must exist before any RC dispatch can succeed.
Verify the automated-release env has the secrets RC needs: NGC_PUBLISH_TOKEN, NGC_PUBLISH_USERNAME, NGC_STAGING_IMAGE_BASE, ARTIFACTORY_URL, ARTIFACTORY_USER, ARTIFACTORY_TOKEN, ARTIFACTORY_PYPI_URL, GITLAB_TRIGGER_TOKEN, GITLAB_PIPELINE_URL. These already exist for nightly; RC reuses them.
Consider an ECR lifecycle policy for aiperf:cache-pr-* images so closed PRs' caches expire after N days. Not blocking for merge.

Test plan

Push a no-op commit to a pull-request/0 branch; confirm pr.yaml produces pr/pr0/aiperf-...whl in artifactory and aiperf:pr-0-<sha> (multi-arch manifest) in ECR.
Merge a small change to main; confirm post-merge.yaml produces post-merge/<run_id>/aiperf-<base>a<run>+g<sha>-py3-none-any.whl and aiperf:main-<sha>; verify the automated-release gate pauses for approval.
Create a throwaway release/0.9.0 branch with version = "0.9.0" committed to pyproject.toml; push a small change; confirm the wheel filename is exactly aiperf-0.9.0-py3-none-any.whl (no alpha/rc suffix, no local-version label) — the workflow did not rewrite pyproject.toml.
Dispatch nightly.yml; confirm a single build job replaces the prior two-job + manifest-stitch topology, and docker buildx imagetools inspect aiperf:nightly-<date>-<sha> shows both amd64 and arm64 manifests in the OCI index.
After creating manual-release-approval, dispatch rc.yaml with a known commit SHA from a release branch; confirm the workflow finds the post-merge run via gh api, pauses on approve, then after approval copies ECR → NGC and Artifactory post-merge/... → rc/<X.Y.Z>/rc<N>/....

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added CI workflows for post-merge, PR, nightly, and release-candidate pipelines with unified multi-arch container builds.
- New actions for bootstrapping multi-arch BuildKit and for computing release metadata.
- Added a build-time utility to embed commit SHAs into artifacts.
- Package metadata updated with homepage/repo/issues URLs.
Documentation
- Contributing guide updated with release pipeline docs and local validation steps.
Removed
- Removed container attribution document and previous multi-arch manifest action.

Bootstraps a multi-arch buildx builder by attaching to per-arch BuildKit pods in the buildkit namespace via headless K8s DNS, with a Kubernetes driver fallback. Lifted from ai-dynamo/dynamo's same-named action; renamed metadata and usage examples from the prior 'init-aiperf-builder' name. This action lets the build workflows produce multi-arch images with a single 'docker buildx build --platform linux/amd64,linux/arm64 --push' instead of running separate per-arch jobs and stitching a manifest after. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Single source of truth for build-flavor version math. Given a flavor (pr, post-merge-main, post-merge-release, nightly) and a commit SHA, emits wheel_version, wheel_filename, container_tag, container_image, container_ref, short_sha, artifactory_subpath, and skip_version_mutation. post-merge-release is the one flavor that does not mutate pyproject.toml: release branches own their own version via the project's shipping process, so the action returns skip_version_mutation=true and the workflow short- circuits update-pyproject-version. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Fires on pushes to pull-request/<N> branches (the copy-pr-bot mirror convention already used by fern-docs.yml). Publishes: wheel -> artifactory: pr/pr<N>/aiperf-<base>.dev0+pr<N>.g<sha>-py3-none-any.whl image -> ECR: aiperf:pr-<N>-<short_sha> (multi-arch manifest) Cache strategy: --cache-to writes only to a per-PR cache tag (aiperf:cache-pr-<N>) so concurrent PRs don't blow each other's cache manifests away. --cache-from reads first from cache-pr-<N>, then falls back to cache-post-merge so a brand-new PR's first build still gets warm layers from main. Builder name uses github.run_id and fresh_builder: true so a canceled prior run never leaves a stale buildx context that collides with the next push. No environment gate. PR artifacts are scratch space. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Fires on pushes to main and release/X.Y.Z. Flavor-conditional behavior: main: wheel = <base>a<run_number>+g<short_sha> (alpha pre-release) image = main-<short_sha> release/*: wheel = <pyproject.toml version, untouched> image = release-<X.Y.Z>-<short_sha> Release branches own their version via the project's shipping process; the workflow skips update-pyproject-version when release_metadata's skip_version_mutation output is true. Wheel goes S3 (build) -> Artifactory post-merge/<run_id>/<filename> (stage). The stage-wheel job is gated by the existing automated-release GitHub environment, matching nightly's pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

…cache Replaces the per-arch matrix + create-multiarch-manifest-ecr two-step choreography with one job that runs: docker buildx build --platform linux/amd64,linux/arm64 --target runtime --push against a multi-arch builder bootstrapped via bootstrap-buildkit. The buildx invocation produces and pushes the OCI index natively, so create-multiarch-manifest is deleted (no consumers remain). stage-nightly-container is simplified to a single 'crane copy' of the multi-arch index, instead of copying per-arch manifests and stitching them with 'crane index append' on NGC. Nightly intentionally runs WITHOUT --cache-from/--cache-to. Nightly is the rigorous build that validates the Dockerfile from scratch (against the BuildKit pod's persistent layer cache only). Drift in pinned base images, system packages, or transitive Python deps now surfaces here rather than being masked by a registry cache layer reused from a stale PR or post-merge build. bootstrap-buildkit is invoked with fresh_builder: true so a prior canceled run or stale builder context cannot collide with a new run. The 'needs:' chain and pipeline-summary job are updated to drop the removed create-multiarch-manifest-ecr step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Manually-dispatched release-candidate pipeline. Does NOT rebuild. Promotes an already-built post-merge artifact (from a release/X.Y.Z branch) to RC locations on NGC staging and Artifactory. Inputs: release_version (X.Y.Z), rc_number, commit_sha (40-char), and skip_security_scan. The 'prepare' job validates inputs and uses gh api to discover the post-merge.yaml run ID for the commit so the workflow can locate the source wheel at Artifactory post-merge/<run_id>/. Promotion topology: approve (manual-release-approval env) - single human gate stage-container (automated-release env) - crane copy ECR aiperf:release-<X.Y.Z>-<sha> NGC aiperf:<X.Y.Z>-rc<N> stage-wheel (automated-release env) - artifactory copy post-merge/<run_id>/<wheel> rc/<X.Y.Z>/rc<N>/<wheel> trigger-gitlab-scan (automated-release env) - mirrors nightly Wheel keeps the post-merge filename (aiperf-<X.Y.Z>-py3-none-any.whl) - the rc-ness is encoded in the artifactory path, consistent with the rule that release branches own their version. Requires the GitHub environment 'manual-release-approval' to be created in Settings > Environments with the release reviewer list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

These files are now produced fresh on every container build by the Dockerfile's python-licenses stage (pip-licenses + cyclonedx-py + generate_python_attributions.py + generate_dpkg_attributions.py) and ship inside the container at /licenses/python/ and /licenses/dpkg/. The hand-maintained copies at the repo root were drifting from the auto-generated truth and serve no purpose now that every release flavor emits them as build artifacts. ATTRIBUTIONS.md (the project's own license attribution file) is unrelated and stays in place. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Adds a Release Pipelines section to CONTRIBUTING.md covering pr, post-merge, nightly, and rc workflows: triggers, wheel destinations, container destinations, wheel-versioning rules per flavor, and the two GitHub environment gates (automated-release for nightly/post-merge staging, manual-release-approval for rc). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

github-actions · 2026-05-18T01:32:51Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b1fb143aad77ff93b5db0566b654d7d177520348

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b1fb143aad77ff93b5db0566b654d7d177520348

Last updated for commit: b1fb143 • Browse code

coderabbitai · 2026-05-18T01:37:59Z

Warning

Rate limit exceeded

@saturley-hall has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a2ad1c81-f452-40a4-8315-a3cade30e310

📥 Commits

Reviewing files that changed from the base of the PR and between 40ba3b4 and b1fb143.

📒 Files selected for processing (2)

.github/actions/bootstrap-buildkit/action.yml
tools/embed_commit_sha.py

Walkthrough

Adds composite actions and workflows to centralize multi-arch buildx container and wheel builds, compute release metadata by flavor, implement nightly/pr/post-merge/RC pipelines, update documentation, and add a commit-SHA embedding helper.

Changes

Release and Build Pipeline Workflows

Layer / File(s)	Summary
Bootstrap BuildKit composite action `.github/actions/bootstrap-buildkit/action.yml`	New composite action that configures multi-arch docker buildx builder by resolving DNS-based remote worker addresses and falling back to Kubernetes driver, with fresh builder cleanup and retry-based bootstrap logic.
Release metadata computation action `.github/actions/release-metadata/action.yml`	New composite action that validates commit SHA and computes flavor-specific (pr, post-merge-main, post-merge-release, nightly) wheel versions, container tags, Artifactory paths, and version mutation flags from `pyproject.toml`.
Nightly workflow multi-arch refactor `.github/workflows/nightly.yml`	Refactored nightly pipeline to replace per-architecture matrix builds and separate manifest creation with single unified buildx multi-arch build; removes dependency on deleted `create-multiarch-manifest` action and updates staging logic to assume manifest already exists from buildx push.
PR build workflow `.github/workflows/pr.yaml`	New workflow triggered on `pull-request/*` pushes that parses PR number, computes metadata, builds amd64-only wheel and multi-arch (amd64/arm64) container, and uploads artifacts to ECR and Artifactory (wheel staging gated).
Post-merge workflow for main and release branches `.github/workflows/post-merge.yaml`	New workflow triggered on pushes to `main` and `release/*` that computes release flavor, conditionally patches version, builds wheel and multi-arch container, validates versions, and uploads wheel to Artifactory via gated environment.
Release candidate promotion workflow `.github/workflows/rc.yaml`	New manual `workflow_dispatch` workflow for RC promotion that validates inputs, discovers prior post-merge artifacts, copies container from ECR to NGC, stages wheel to RC Artifactory path, and optionally triggers GitLab security scans.
Documentation and misc `CONTRIBUTING.md`, `pyproject.toml`, `tools/embed_commit_sha.py`	Added "Release Pipelines" docs, added `[project.urls]` metadata, and added a CI helper to embed commit SHA into `src/aiperf/_build_info.py`.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐇 I hopped through builds and tags tonight,
Poked DNS for workers, set builders right,
Flavors aligned—pr, nightly, release in sight,
Wheels and indices pushed on multi-arch flight,
My tiny paws stamped metadata just right. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: introduction of new CI workflows (PR/post-merge/RC pipelines) and adoption of a shared buildx worker pattern across build workflows.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/actions/bootstrap-buildkit/action.yml:
- Around line 28-31: Normalize and trim each comma-separated token from the arch
input (the 'arch' input value) before generating platform/DNS strings and when
mapping to workers; ensure you call .trim() (or equivalent) on each token and
normalize forms like "arm64" to "linux/arm64" consistently. When computing
worker_addresses, only emit/set the output (worker_addresses) if every requested
arch token successfully resolves to a worker (i.e., require full resolution for
all requested arches), otherwise leave worker_addresses empty or fall back;
apply the same trimming+full-resolution guard to the other blocks that build
platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping
sections).

In @.github/actions/release-metadata/action.yml:
- Around line 101-105: The BASE_VERSION extraction currently restricts to a bare
X.Y.Z and fails when pyproject.toml contains suffixes (e.g. 1.2.3rc1); change
the parsing in the block that sets BASE_VERSION to capture the entire quoted
version string (including any suffix) instead of only digits-dots (i.e., grab
the value between the quotes using the existing grep/sed pipeline), then allow
the downstream release-flavor logic to trim or interpret suffixes as needed;
make the same change in the analogous parsing block around the later section
(the block currently at lines referenced in the review) so both places read the
full quoted version into BASE_VERSION for flavor-specific handling.

In @.github/workflows/pr.yaml:
- Around line 21-24: The workflow currently triggers on branches matching
'pull-request/*' and checks out and runs untrusted PR code on the privileged
self-hosted runner (prod-aiperf-default-v1) and later publishes with
AWS/Artifactory credentials; change this by splitting the pipeline into two
workflows or adding an explicit approval gate: move the checkout/build steps for
PR branches into an unprivileged PR workflow that uses GitHub-hosted runners and
never uses production credentials, and create a separate trusted
promotion/publish workflow that only runs on protected branches or after a
manual approval event and runs on prod-aiperf-default-v1 with AWS/Artifactory
credentials; ensure references to the 'pull-request/*' trigger, any checkout
actions, and the publish steps that use AWS/Artifactory credentials are updated
accordingly.
- Around line 47-48: The workflow uses mutable tags for external actions
(actions/checkout@v4 and astral-sh/setup-uv@v5) which allows retagged code to
run with repository secrets; replace each mutable tag occurrence (including the
second actions/checkout usage and the astral-sh/setup-uv usage) with the
corresponding immutable commit SHA for the exact release you trust, updating the
uses entries to the full "owner/repo@sha" form so the workflow runs the pinned
action commits only.

In @.github/workflows/rc.yaml:
- Around line 118-125: The workflow lookup uses RESPONSE (gh api call) and
filters only by head_sha (${COMMIT_SHA}), which can match commits across
branches; update the gh api call that builds RESPONSE to also pass the release
branch filter (add -f "branch=${RELEASE_BRANCH}" or the repo’s release branch
variable) so the returned workflow_runs are constrained to the intended release
branch before extracting RUN_ID; ensure the branch variable you add is
defined/propagated in the workflow context.

In `@CONTRIBUTING.md`:
- Line 201: Replace the incorrect CLI invocation "gh act -W
.github/workflows/pr.yaml" with the standalone act command; locate the
occurrence in CONTRIBUTING.md and change it to use "act -W
.github/workflows/pr.yaml" (or simply "act" as shown later) so the documentation
references the nektos/act binary rather than a non-existent gh subcommand.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3dfbf66a-03c8-431e-bd4a-1f8553d1c39b

📥 Commits

Reviewing files that changed from the base of the PR and between bac953d and daa9ae4.

📒 Files selected for processing (10)

.github/actions/bootstrap-buildkit/action.yml
.github/actions/create-multiarch-manifest/action.yml
.github/actions/release-metadata/action.yml
.github/workflows/nightly.yml
.github/workflows/post-merge.yaml
.github/workflows/pr.yaml
.github/workflows/rc.yaml
ATTRIBUTIONS-Python.md
ATTRIBUTIONS-container.md
CONTRIBUTING.md

💤 Files with no reviewable changes (2)

ATTRIBUTIONS-container.md
.github/actions/create-multiarch-manifest/action.yml

coderabbitai · 2026-05-18T01:38:02Z

+  arch:
+    description: 'Comma-separated Docker platform(s): linux/amd64, linux/arm64, or linux/amd64,linux/arm64'
+    required: false
+    default: 'linux/amd64,linux/arm64'


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize arch tokens and require full resolution before taking the remote path.

The input docs allow linux/amd64, linux/arm64, but the parser never trims per-token whitespace, so the second entry becomes arm64 and generates invalid DNS/platform strings. Also, worker_addresses != '' accepts a partially resolved set, so a requested multi-arch build can proceed with only one worker instead of falling back. Trim each token before use and only emit worker_addresses when every requested arch resolves.

Suggested fix

WORKER_ADDRS="" + RESOLVED_COUNT=0 IFS=',' read -ra ARCHS <<< "$ARCH" for arch in "${ARCHS[@]}"; do + arch="${arch//[[:space:]]/}" DNS="buildkit-${arch}-0.buildkit-${arch}-headless.${NAMESPACE}.svc.cluster.local" if nslookup "$DNS" >/dev/null 2>&1; then WORKER_ADDRS="${WORKER_ADDRS:+${WORKER_ADDRS},}tcp://${DNS}:${PORT}" + RESOLVED_COUNT=$((RESOLVED_COUNT + 1)) echo "Resolved ${arch} worker: ${DNS}" else echo "No DNS for ${arch} worker (${DNS})" fi done + + if [ "${RESOLVED_COUNT}" -ne "${`#ARCHS`[@]}" ]; then + WORKER_ADDRS="" + fi

FIRST=true IFS=',' read -ra ARCHS <<< "$ARCH" for arch in "${ARCHS[@]}"; do + arch="${arch//[[:space:]]/}" # Comma-containing values (nodeselector, tolerations) must be wrapped in

Also applies to: 84-100, 112-127, 134-140

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/actions/bootstrap-buildkit/action.yml around lines 28 - 31, Normalize and trim each comma-separated token from the arch input (the 'arch' input value) before generating platform/DNS strings and when mapping to workers; ensure you call .trim() (or equivalent) on each token and normalize forms like "arm64" to "linux/arm64" consistently. When computing worker_addresses, only emit/set the output (worker_addresses) if every requested arch token successfully resolves to a worker (i.e., require full resolution for all requested arches), otherwise leave worker_addresses empty or fall back; apply the same trimming+full-resolution guard to the other blocks that build platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping sections).

coderabbitai · 2026-05-18T01:38:02Z

+        BASE_VERSION=$(grep -m1 -E '^version = "[0-9]+\.[0-9]+\.[0-9]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')
+        if [[ -z "${BASE_VERSION}" ]]; then
+          echo "::error::Could not parse base version from pyproject.toml"
+          exit 1
+        fi


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The base-version parser rejects suffixes that release branches are supposed to preserve.

BASE_VERSION only matches bare X.Y.Z, but the post-merge-release path explicitly says release branches may keep rc/post/dev suffixes from pyproject.toml. A branch carrying 1.2.3rc1 will fail at Line 101 before it ever reaches the release-flavor logic. Parse the full quoted version here, then let each flavor decide how much of it to use.

Suggested fix

- BASE_VERSION=$(grep -m1 -E '^version = "[0-9]+\.[0-9]+\.[0-9]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/') + BASE_VERSION=$(grep -m1 -E '^version = "[^"]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')

Also applies to: 132-145

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/actions/release-metadata/action.yml around lines 101 - 105, The BASE_VERSION extraction currently restricts to a bare X.Y.Z and fails when pyproject.toml contains suffixes (e.g. 1.2.3rc1); change the parsing in the block that sets BASE_VERSION to capture the entire quoted version string (including any suffix) instead of only digits-dots (i.e., grab the value between the quotes using the existing grep/sed pipeline), then allow the downstream release-flavor logic to trim or interpret suffixes as needed; make the same change in the analogous parsing block around the later section (the block currently at lines referenced in the review) so both places read the full quoted version into BASE_VERSION for flavor-specific handling.

coderabbitai · 2026-05-18T01:38:02Z

+      - name: Checkout
+        uses: actions/checkout@v4


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/pr.yaml

Repository: ai-dynamo/aiperf

Length of output: 11991

Pin external actions to immutable SHAs to prevent privilege escalation.

This workflow has access to publishing credentials (AWS ECR and Artifactory tokens), but external actions actions/checkout@v4 (lines 48, 78) and astral-sh/setup-uv@v5 (line 81) use mutable version tags. If these actions are retagged or compromised upstream, they would inherit access to these secrets. Replace with full commit SHAs to ensure only the intended action code runs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/pr.yaml around lines 47 - 48, The workflow uses mutable tags for external actions (actions/checkout@v4 and astral-sh/setup-uv@v5) which allows retagged code to run with repository secrets; replace each mutable tag occurrence (including the second actions/checkout usage and the astral-sh/setup-uv usage) with the corresponding immutable commit SHA for the exact release you trust, updating the uses entries to the full "owner/repo@sha" form so the workflow runs the pinned action commits only.

coderabbitai · 2026-05-18T01:38:02Z

+          RESPONSE=$(gh api -X GET \
+            "/repos/${GITHUB_REPOSITORY}/actions/workflows/post-merge.yaml/runs" \
+            -f "head_sha=${COMMIT_SHA}" \
+            -f "status=success" \
+            -f "per_page=10")
+
+          RUN_ID=$(echo "${RESPONSE}" | jq -r '.workflow_runs | sort_by(.created_at) | reverse | .[0].id // empty')
+          if [ -z "${RUN_ID}" ]; then


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

GitHub REST API (workflow runs): Which query parameter should be used to filter runs by branch (branch/head_branch) when calling /actions/workflows/{workflow_id}/runs?

💡 Result:

To filter workflow runs by branch when calling the GitHub REST API endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs, you should use the branch query parameter [1]. The branch parameter accepts the name of the branch as a string [1]. When provided, it returns only the workflow runs associated with that specific branch [1]. Example usage: GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs?branch=main Note that while some GitHub API contexts may refer to head branches or references, the specific query parameter for this endpoint is consistently documented as branch [2][3][1].

Citations:

1: https://www.withone.ai/knowledge/github/conn_mod_def%3A%3AGJ3YfVeYEEY%3A%3A8cau0s5EQ8me4n-NqNU9iQ

2: https://docs.github.com/v3/actions/workflow-runs

3: https://docs.github.com/en/rest/actions/workflow-runs

🌐 Web query:

GitHub REST API workflow runs endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs parameters documentation

💡 Result:

The GitHub REST API endpoint to list workflow runs for a specific workflow is: GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs This endpoint allows you to retrieve a list of all runs for a particular workflow [1]. The {workflow_id} path parameter can be either the unique integer ID of the workflow or its file name (e.g., main.yaml) [1][2]. Common query parameters supported by this endpoint include: - actor (string): Filters results by the user who triggered the workflow run [1]. - branch (string): Filters results by the branch name [1]. - event (string): Filters results by the event that triggered the workflow (e.g., push, pull_request) [1]. - status (string): Filters results by the status or conclusion of the run (e.g., success, failure, in_progress) [3][4]. - per_page (integer): The number of results to return per page (max 100) [5]. - page (integer): The page number of the results to fetch [5]. - created (string): Filters runs created within a specific date-time range [3][4]. - exclude_pull_requests (boolean): If true, omits pull request-related runs from the response [3][4]. - check_suite_id (integer): Filters runs associated with a specific check suite [3][4]. - head_sha (string): Filters runs associated with a specific commit SHA [3][4]. For authentication, if the repository is private, you must use an access token with the repo scope [1]. Anyone with read access to the repository can use this endpoint [1].

Citations:

1: https://github.apidog.io/api-3489171

2: https://docs.github.com/rest/actions/workflows

3: https://docs.github.com/en/rest/actions/workflow-runs

4: https://docs.github.com/rest/actions/workflow-runs

5: https://docs.github.com/rest/actions/workflow-jobs

Constrain post-merge run discovery to the intended release branch.

Run lookup filters only by head_sha, so the same commit hash on multiple branches (e.g., main and release/) can cause the wrong artifact set to be selected for RC promotion.

Add the branch parameter to filter by the release branch:

Proposed fix

- name: Discover post-merge run for commit id: discover env: GH_TOKEN: ${{ github.token }} COMMIT_SHA: ${{ inputs.commit_sha }} + VERSION: ${{ inputs.release_version }} run: | set -euo pipefail RESPONSE=$(gh api -X GET \ "/repos/${GITHUB_REPOSITORY}/actions/workflows/post-merge.yaml/runs" \ -f "head_sha=${COMMIT_SHA}" \ + -f "branch=release/${VERSION}" \ -f "status=success" \ -f "per_page=10")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/rc.yaml around lines 118 - 125, The workflow lookup uses RESPONSE (gh api call) and filters only by head_sha (${COMMIT_SHA}), which can match commits across branches; update the gh api call that builds RESPONSE to also pass the release branch filter (add -f "branch=${RELEASE_BRANCH}" or the repo’s release branch variable) so the returned workflow_runs are constrained to the intended release branch before extracting RUN_ID; ensure the branch variable you add is defined/propagated in the workflow context.

coderabbitai · 2026-05-18T01:38:02Z

+yamllint .github/workflows/
+
+# Smoke-test the bootstrap-buildkit action against a local docker daemon:
+gh act -W .github/workflows/pr.yaml


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify act is standalone, not a gh subcommand # Check if act is available as standalone command if command -v act &> /dev/null; then echo "✓ act is available as standalone command" act --version else echo "✗ act not found" fi # Check if gh has act subcommand if gh act --help &> /dev/null 2>&1; then echo "✓ gh act subcommand exists" else echo "✗ gh act subcommand does not exist (expected)" fi

Repository: ai-dynamo/aiperf

Length of output: 122

🏁 Script executed:

cat -n CONTRIBUTING.md | sed -n '190,215p'

Repository: ai-dynamo/aiperf

Length of output: 1136

🏁 Script executed:

cat CONTRIBUTING.md | grep -E '^\s*[|+\-*]|^\s*[^a-zA-Z0-9#\s\`]' | head -30

Repository: ai-dynamo/aiperf

Length of output: 84

🏁 Script executed:

# Search for ASCII art diagrams (non-table lines with repeated special chars) rg -n '^\s*[\+\-]{2,}|^\s*[|]{1,}\s*[|\-\s]{5,}' CONTRIBUTING.md

Repository: ai-dynamo/aiperf

Length of output: 135

🏁 Script executed:

# Check for actual ASCII art (box drawings, flow diagrams) rg -n '^\s*\+[-=]+\+|^\s*[\|/\\][^\w]|^\s*[╔═╗║╚╝]' CONTRIBUTING.md | head -20

Repository: ai-dynamo/aiperf

Length of output: 1256

🏁 Script executed:

# Check the full context around potential ASCII diagrams - look for lines with unusual patterns cat CONTRIBUTING.md | grep -v '^|' | grep -v '^#' | grep -v '^\s*$' | grep -v '^\s*-' | head -40

Repository: ai-dynamo/aiperf

Length of output: 2539

🏁 Script executed:

# Check entire file for actual ASCII art diagrams (not tables) # Look for box-drawing patterns or flow diagrams wc -l CONTRIBUTING.md && echo "---" && \ rg -i 'diagram|ascii|flow|chart|architecture' CONTRIBUTING.md

Repository: ai-dynamo/aiperf

Length of output: 879

Incorrect command: gh act is not valid.

The GitHub CLI (gh) does not have an act subcommand. The act tool is a standalone binary from nektos/act. Line 209 below correctly shows usage as just act without the gh prefix.

Proposed fix

-gh act -W .github/workflows/pr.yaml +act -W .github/workflows/pr.yaml

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

gh act -W .github/workflows/pr.yaml

act -W .github/workflows/pr.yaml

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CONTRIBUTING.md` at line 201, Replace the incorrect CLI invocation "gh act -W .github/workflows/pr.yaml" with the standalone act command; locate the occurrence in CONTRIBUTING.md and change it to use "act -W .github/workflows/pr.yaml" (or simply "act" as shown later) so the documentation references the nektos/act binary rather than a non-existent gh subcommand.

GitHub Actions silently strips a job output whose value contains a registered secret, to prevent leaking the secret to downstream jobs. The release-metadata action's container_image and container_ref both embed the ECR registry hostname (built from secrets.AWS_ACCOUNT_ID + AWS_DEFAULT_REGION), so when pr.yaml and post-merge.yaml promoted them to the prepare job's top-level outputs, GitHub dropped them. The downstream build job then ran 'docker buildx build --tag ""' and failed with 'invalid tag "": repository name must have at least one component'. Observed in https://github.com/ai-dynamo/aiperf/actions/runs/26008818007 with annotations 'Skip output container_ref/container_image since it may contain secret.' Fix mirrors nightly.yml's pattern: only promote container_tag across job boundaries (no secret), and reconstruct the full ECR ref locally in each step that needs it from the AWS secret env vars the step already has access to. Changes: - release-metadata action: remove container_image and container_ref outputs; drop the now-unused container_registry/container_repository inputs; document the constraint in a comment near the GITHUB_OUTPUT block. - pr.yaml, post-merge.yaml: drop the two outputs from prepare's job outputs; pass container_tag instead and compute CONTAINER_REF locally in the build step ('CONTAINER_REF="${REGISTRY}/aiperf:${CONTAINER_TAG}"'). Summary steps now show container_tag (which is what you'd grep for in ECR anyway). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

…sues) Repository in [project.urls] is meant to be the canonical source repo URL per PEP 621 convention (matches numpy, pandas, requests, scipy, … — all point at the repo root, not a specific commit). Until now the nightly/PR/post-merge pipelines mutated this field per build to a commit-specific URL, which is misuse: SBOM tools, pip show, and the Artifactory UI all expect 'where do I clone this from?', not 'what revision is this exact wheel?'. This commit puts the canonical URLs statically in pyproject.toml so they ship in every wheel's METADATA without per-commit rewriting: Homepage = "https://github.com/ai-dynamo/aiperf" Repository = "https://github.com/ai-dynamo/aiperf" Issues = "https://github.com/ai-dynamo/aiperf/issues" Per-commit traceability still lives where it belongs: - the wheel's version string carries the short SHA via PEP 440 local-version (e.g. 0.8.0.dev0+pr951.g88f1498) - aiperf.__commit_sha__ carries the full SHA at runtime (via src/aiperf/_build_info.py, which CI continues to generate) A follow-up commit removes the [project.urls] mutation from the embed-commit-sha CI script so this field never gets rewritten. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Replaces the inline Python heredoc in the nightly build-container job (soon also in pr.yaml and post-merge.yaml — separate commit) with a single script call: python3 tools/embed_commit_sha.py "\${{ github.sha }}" The script writes src/aiperf/_build_info.py so aiperf.__commit_sha__ returns the full SHA at runtime. It does NOT mutate pyproject.toml — the canonical Repository URL is already static there (per the preceding commit). The script lives in tools/ so it is a build-time utility, not packaged inside the wheel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Restructures pr.yaml and post-merge.yaml so wheel publication to Artifactory runs in its own job, gated by the existing automated-release GitHub environment, and uses the JFrog CLI instead of curl -T. WHY PR wheel filenames contain a '+' (PEP 440 local-version label, e.g. aiperf-0.8.0.dev0+pr951.g88f1498-py3-none-any.whl). Artifactory's generic-repo deploy handler rejected the curl PUT URL with HTTP 405, observed in run 26009595857 on PR #951. Same latent issue applies to post-merge-on-main wheels (which also use '+' via the alpha local label). 'jf rt upload' URL-encodes wheel filenames correctly and uses Artifactory's documented deploy endpoint. Putting the upload in its own job lets only that step run under the automated-release env (the build job stays ungated for fast iteration) and re-uses the post-merge two-job pattern that already worked. HOW build (no env, prod-aiperf-default-v1): Pushes the multi-arch runtime image to ECR. Populates cache-pr-<N> (or cache-post-merge) as a side effect of mode=max. stage-wheel (env: automated-release, prod-aiperf-default-v1): 1. Re-applies the same pyproject.toml + _build_info.py mutations build performed, so the wheel-builder cache key matches. 2. Attaches to the same builder_name as build, so it lands on the same K8s BuildKit pods (StatefulSet, shared PV cache). 3. Runs `buildx --target wheel-artifact --output type=local` with --cache-from only (no --cache-to). Every layer is hot from the build job, so only the scratch-export step actually runs; the wheel materializes locally as a near-instant op. 4. Validates the wheel (twine check + venv install + version check + aiperf --help). 5. Sets up the JFrog CLI with JF_URL derived by stripping '/artifactory/<repo>' from ARTIFACTORY_URL (keeps the existing secret format intact; nightly's curl path stays compatible). 6. jf rt upload --flat to '<repo>/<artifactory_subpath>'. post-merge.yaml drops the prior S3 intermediate entirely — the buildx-cache extraction makes it redundant. REQUIRED SECRET (must exist on the repo before this can run) ARTIFACTORY_REPO_NAME = sw-dynamo-aiperf-pypi-local ARTIFACTORY_URL stays as-is (no change needed); workflows derive the JFrog platform URL by stripping the repo suffix in shell. NOT CHANGED nightly.yml and rc.yaml still use curl -T. Nightly wheels (.dev<YYYYMMDD>) and RC wheels (literal pyproject version) never contain '+', so they keep working without code changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/nightly.yml (1)
401-416: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Dead code: matrix.arch references in a non-matrix job will always skip these steps.

The build-container job was refactored from a matrix build to a single multi-platform buildx job, but several steps still reference matrix.arch == 'amd64'. Since there's no matrix defined, matrix.arch is undefined and these conditions evaluate to false, causing the steps to be silently skipped:

Line 402: Generate aiperf-nightly wheel variant

Line 424: Validate wheels

Line 472: Run unit tests against installed nightly wheel

Line 511: Upload wheels to S3

This means nightly wheels are never validated, tested, or uploaded.
🐛 Proposed fix: Remove the obsolete matrix conditionals
       - name: Generate aiperf-nightly wheel variant (amd64 only)
-        if: matrix.arch == 'amd64'
         run: |
Apply the same removal to lines 424, 472, and 511 (or rename the conditions to document they always run on the single amd64 wheel build).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly.yml around lines 401 - 416, The steps inside the
build-container job that are gated by the non-existent matrix variable are dead;
remove the obsolete "if: matrix.arch == 'amd64'" conditionals from the steps
named "Generate aiperf-nightly wheel variant", "Validate wheels", "Run unit
tests against installed nightly wheel" and "Upload wheels to S3" so they always
run in the single multi-platform buildx job (or replace the condition with an
appropriate runtime check if you intended to gate to amd64 specifically).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/embed_commit_sha.py`:
- Line 1: The script starting with the shebang line "#!/usr/bin/env python3" is
not executable which breaks the pre-commit hook; mark the file executable by
setting the executable bit (e.g., run chmod +x on the script or use git add
--chmod=+x when committing) so CI/pre-commit passes. Ensure the change is staged
and committed.
- Around line 27-36: The embed function currently calls
Path("src/aiperf/_build_info.py").write_text(...) without handling filesystem
errors; update embed to ensure the parent directory exists (create with
mkdir(parents=True, exist_ok=True)), then wrap the write_text call in a
try/except that catches OSError (and/or Exception), and on failure log or print
a clear error message including the exception details and the target path (use
build_info and sha), and exit/raise a controlled error rather than letting a raw
exception propagate; reference the embed function, the build_info Path variable,
and the write_text call when making the changes.
- Line 27: The embed(sha: str) function must validate its sha parameter before
use; add a guard that checks sha is a non-empty string and matches a reasonable
git SHA pattern (for example hex characters, 7–40 chars) and raise a clear
ValueError if it fails; update embed to perform this validation at the top
(validate non-empty and regex like ^[0-9a-fA-F]{7,40}$) so callers get an
explicit error instead of downstream failures.

---

Outside diff comments:
In @.github/workflows/nightly.yml:
- Around line 401-416: The steps inside the build-container job that are gated
by the non-existent matrix variable are dead; remove the obsolete "if:
matrix.arch == 'amd64'" conditionals from the steps named "Generate
aiperf-nightly wheel variant", "Validate wheels", "Run unit tests against
installed nightly wheel" and "Upload wheels to S3" so they always run in the
single multi-platform buildx job (or replace the condition with an appropriate
runtime check if you intended to gate to amd64 specifically).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98303f33-6afa-440e-8f8b-8bda853dcd8e

📥 Commits

Reviewing files that changed from the base of the PR and between daa9ae4 and 40ba3b4.

📒 Files selected for processing (6)

.github/actions/release-metadata/action.yml
.github/workflows/nightly.yml
.github/workflows/post-merge.yaml
.github/workflows/pr.yaml
pyproject.toml
tools/embed_commit_sha.py

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

coderabbitai · 2026-05-18T07:08:17Z

@@ -0,0 +1,48 @@
+#!/usr/bin/env python3


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

The file has a shebang but is not marked executable, blocking CI.

The pre-commit hook failure indicates the file needs the executable bit set.

Fix by running:

chmod +x tools/embed_commit_sha.py

Or on Windows:

git add --chmod=+x tools/embed_commit_sha.py

🧰 Tools

🪛 GitHub Actions: Pre-commit / 0_pre-commit.txt

[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed (exit code 1): file has a shebang but is not marked executable. If intended to be executable, run 'chmod +x tools/embed_commit_sha.py' (or on Windows use 'git add --chmod=+x ...').

🪛 GitHub Actions: Pre-commit / pre-commit

[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed: file has a shebang but is not marked executable. Try: chmod +x tools/embed_commit_sha.py (or git add --chmod=+x on Windows).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/embed_commit_sha.py` at line 1, The script starting with the shebang line "#!/usr/bin/env python3" is not executable which breaks the pre-commit hook; mark the file executable by setting the executable bit (e.g., run chmod +x on the script or use git add --chmod=+x when committing) so CI/pre-commit passes. Ensure the change is staged and committed.

coderabbitai · 2026-05-18T07:08:17Z

+def embed(sha: str) -> None:
+    build_info = Path("src/aiperf/_build_info.py")
+    build_info.write_text(
+        "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n"
+        "# SPDX-License-Identifier: Apache-2.0\n"
+        '"""Build-time metadata. Generated by CI; do not edit."""\n'
+        "\n"
+        f'COMMIT_SHA = "{sha}"\n'
+    )
+    print(f"Wrote {build_info} with COMMIT_SHA = {sha}")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add error handling for filesystem operations.

The write_text() call can fail (permission denied, disk full, parent directory missing) and will raise an uncaught exception. This produces cryptic build failures instead of clear error messages.

🛡️ Proposed fix with error handling

def embed(sha: str) -> None: build_info = Path("src/aiperf/_build_info.py") - build_info.write_text( - "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n" - "# SPDX-License-Identifier: Apache-2.0\n" - '"""Build-time metadata. Generated by CI; do not edit."""\n' - "\n" - f'COMMIT_SHA = "{sha}"\n' - ) - print(f"Wrote {build_info} with COMMIT_SHA = {sha}") + try: + build_info.parent.mkdir(parents=True, exist_ok=True) + build_info.write_text( + "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n" + "# SPDX-License-Identifier: Apache-2.0\n" + '"""Build-time metadata. Generated by CI; do not edit."""\n' + "\n" + f'COMMIT_SHA = "{sha}"\n' + ) + print(f"Wrote {build_info} with COMMIT_SHA = {sha}") + except OSError as e: + print(f"Failed to write {build_info}: {e}", file=sys.stderr) + raise SystemExit(1) from e

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/embed_commit_sha.py` around lines 27 - 36, The embed function currently calls Path("src/aiperf/_build_info.py").write_text(...) without handling filesystem errors; update embed to ensure the parent directory exists (create with mkdir(parents=True, exist_ok=True)), then wrap the write_text call in a try/except that catches OSError (and/or Exception), and on failure log or print a clear error message including the exception details and the target path (use build_info and sha), and exit/raise a controlled error rather than letting a raw exception propagate; reference the embed function, the build_info Path variable, and the write_text call when making the changes.

coderabbitai · 2026-05-18T07:08:17Z

+from pathlib import Path
+
+
+def embed(sha: str) -> None:


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate the SHA parameter.

The function does not validate that sha is non-empty or has a reasonable format. While CI workflows are expected to pass valid commit SHAs, a basic check improves robustness and produces clearer errors if called incorrectly.

✅ Proposed validation check

def embed(sha: str) -> None: + if not sha or not sha.strip(): + raise ValueError("commit SHA cannot be empty") build_info = Path("src/aiperf/_build_info.py")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/embed_commit_sha.py` at line 27, The embed(sha: str) function must validate its sha parameter before use; add a guard that checks sha is a non-empty string and matches a reasonable git SHA pattern (for example hex characters, 7–40 chars) and raise a clear ValueError if it fails; update embed to perform this validation at the top (validate non-empty and regex like ^[0-9a-fA-F]{7,40}$) so callers get an explicit error instead of downstream failures.

The K8s-driver fallback in bootstrap-buildkit was using 'nodeselector=...,purpose=aiperf-build' — a label inherited from the dynamo-cluster bootstrap pattern. aiperf's cluster labels its builder pool with 'purpose=build' instead, so the fallback path had nowhere to schedule (confirmed via FailedScheduling events: 'purpose In [aiperf-build] not in purpose In [build]'). This is dormant under normal operation (the remote-driver path attaches to the healthy buildkit-{amd64,arm64} StatefulSet pods and the fallback never runs). It only matters when the StatefulSet pods are temporarily unreachable — at which point the fallback should actually be able to land. Observed in nightly run 26018678419 where the StatefulSet was rolling and the fallback couldn't recover. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

Pairs with the preceding nodeselector fix. All 18 purpose=build nodes in aiperf's cluster carry the taint buildkit-worker=true:NoSchedule (verified via 'kubectl get nodes -l purpose=build -o custom-columns= ...,TAINTS:.spec.taints[*]'). The action's previous default tolerated buildkit-fallback-worker — a key inherited from dynamo's two-pool convention — which doesn't exist in aiperf's cluster. With this change, the K8s-driver fallback can actually land on the same nodes the buildkit StatefulSet runs on when the StatefulSet itself is rolling or unreachable. The nodes are otherwise idle in that condition (the StatefulSet pods are restarting / not consuming resources), so co-tenancy on those nodes during outages is the intended use of the fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

The script has a shebang line (#!/usr/bin/env python3) but lacked the executable bit, which trips pre-commit's check-shebang-scripts-are- executable hook. Workflows invoke it as 'python3 tools/...' so the mode bit isn't strictly required for CI, but the hook's invariant holds: if a file has a shebang, declare it executable. chmod +x + git update-index --chmod=+x persists the mode change in the tree. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

saturley-hall and others added 8 commits May 17, 2026 21:22

github-actions Bot added the ci label May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

saturley-hall and others added 2 commits May 17, 2026 21:58

Merge branch 'main' into harrison/improve-ci-from-release-automation

88f1498

Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

saturley-hall changed the title ~~ci: PR/post-merge/RC pipelines + dynamo buildx pattern across all build workflows~~ ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows May 18, 2026

saturley-hall and others added 3 commits May 18, 2026 03:04

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

copy-pr-bot Bot had a problem deploying to automated-release May 18, 2026 21:29 Failure

saturley-hall and others added 2 commits May 18, 2026 17:31

copy-pr-bot Bot had a problem deploying to automated-release May 18, 2026 21:36 Failure

copy-pr-bot Bot deployed to automated-release May 18, 2026 21:47 Active

	gh act -W .github/workflows/pr.yaml
	act -W .github/workflows/pr.yaml

Conversation

saturley-hall commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What each workflow does

Wheel versioning rules

RC is promote-only

Environment gates

Dynamo buildx pattern

Cache strategy per workflow

Files deleted

Before merging

Test plan

Summary by CodeRabbit

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saturley-hall commented May 18, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading