ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows#951
ci: PR/post-merge/RC pipelines + buildx worker pattern across all build workflows#951saturley-hall wants to merge 16 commits into
Conversation
Bootstraps a multi-arch buildx builder by attaching to per-arch BuildKit pods in the buildkit namespace via headless K8s DNS, with a Kubernetes driver fallback. Lifted from ai-dynamo/dynamo's same-named action; renamed metadata and usage examples from the prior 'init-aiperf-builder' name. This action lets the build workflows produce multi-arch images with a single 'docker buildx build --platform linux/amd64,linux/arm64 --push' instead of running separate per-arch jobs and stitching a manifest after. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Single source of truth for build-flavor version math. Given a flavor (pr, post-merge-main, post-merge-release, nightly) and a commit SHA, emits wheel_version, wheel_filename, container_tag, container_image, container_ref, short_sha, artifactory_subpath, and skip_version_mutation. post-merge-release is the one flavor that does not mutate pyproject.toml: release branches own their own version via the project's shipping process, so the action returns skip_version_mutation=true and the workflow short- circuits update-pyproject-version. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Fires on pushes to pull-request/<N> branches (the copy-pr-bot mirror convention already used by fern-docs.yml). Publishes: wheel -> artifactory: pr/pr<N>/aiperf-<base>.dev0+pr<N>.g<sha>-py3-none-any.whl image -> ECR: aiperf:pr-<N>-<short_sha> (multi-arch manifest) Cache strategy: --cache-to writes only to a per-PR cache tag (aiperf:cache-pr-<N>) so concurrent PRs don't blow each other's cache manifests away. --cache-from reads first from cache-pr-<N>, then falls back to cache-post-merge so a brand-new PR's first build still gets warm layers from main. Builder name uses github.run_id and fresh_builder: true so a canceled prior run never leaves a stale buildx context that collides with the next push. No environment gate. PR artifacts are scratch space. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Fires on pushes to main and release/X.Y.Z. Flavor-conditional behavior:
main: wheel = <base>a<run_number>+g<short_sha> (alpha pre-release)
image = main-<short_sha>
release/*: wheel = <pyproject.toml version, untouched>
image = release-<X.Y.Z>-<short_sha>
Release branches own their version via the project's shipping process;
the workflow skips update-pyproject-version when release_metadata's
skip_version_mutation output is true.
Wheel goes S3 (build) -> Artifactory post-merge/<run_id>/<filename>
(stage). The stage-wheel job is gated by the existing automated-release
GitHub environment, matching nightly's pattern.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
…cache Replaces the per-arch matrix + create-multiarch-manifest-ecr two-step choreography with one job that runs: docker buildx build --platform linux/amd64,linux/arm64 --target runtime --push against a multi-arch builder bootstrapped via bootstrap-buildkit. The buildx invocation produces and pushes the OCI index natively, so create-multiarch-manifest is deleted (no consumers remain). stage-nightly-container is simplified to a single 'crane copy' of the multi-arch index, instead of copying per-arch manifests and stitching them with 'crane index append' on NGC. Nightly intentionally runs WITHOUT --cache-from/--cache-to. Nightly is the rigorous build that validates the Dockerfile from scratch (against the BuildKit pod's persistent layer cache only). Drift in pinned base images, system packages, or transitive Python deps now surfaces here rather than being masked by a registry cache layer reused from a stale PR or post-merge build. bootstrap-buildkit is invoked with fresh_builder: true so a prior canceled run or stale builder context cannot collide with a new run. The 'needs:' chain and pipeline-summary job are updated to drop the removed create-multiarch-manifest-ecr step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Manually-dispatched release-candidate pipeline. Does NOT rebuild.
Promotes an already-built post-merge artifact (from a release/X.Y.Z
branch) to RC locations on NGC staging and Artifactory.
Inputs: release_version (X.Y.Z), rc_number, commit_sha (40-char), and
skip_security_scan. The 'prepare' job validates inputs and uses gh api
to discover the post-merge.yaml run ID for the commit so the workflow
can locate the source wheel at Artifactory post-merge/<run_id>/.
Promotion topology:
approve (manual-release-approval env) - single human gate
stage-container (automated-release env) - crane copy
ECR aiperf:release-<X.Y.Z>-<sha>
NGC aiperf:<X.Y.Z>-rc<N>
stage-wheel (automated-release env) - artifactory copy
post-merge/<run_id>/<wheel>
rc/<X.Y.Z>/rc<N>/<wheel>
trigger-gitlab-scan (automated-release env) - mirrors nightly
Wheel keeps the post-merge filename (aiperf-<X.Y.Z>-py3-none-any.whl) -
the rc-ness is encoded in the artifactory path, consistent with the rule
that release branches own their version.
Requires the GitHub environment 'manual-release-approval' to be created
in Settings > Environments with the release reviewer list.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
These files are now produced fresh on every container build by the Dockerfile's python-licenses stage (pip-licenses + cyclonedx-py + generate_python_attributions.py + generate_dpkg_attributions.py) and ship inside the container at /licenses/python/ and /licenses/dpkg/. The hand-maintained copies at the repo root were drifting from the auto-generated truth and serve no purpose now that every release flavor emits them as build artifacts. ATTRIBUTIONS.md (the project's own license attribution file) is unrelated and stays in place. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Adds a Release Pipelines section to CONTRIBUTING.md covering pr, post-merge, nightly, and rc workflows: triggers, wheel destinations, container destinations, wheel-versioning rules per flavor, and the two GitHub environment gates (automated-release for nightly/post-merge staging, manual-release-approval for rc). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b1fb143aad77ff93b5db0566b654d7d177520348Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b1fb143aad77ff93b5db0566b654d7d177520348Last updated for commit: |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdds composite actions and workflows to centralize multi-arch buildx container and wheel builds, compute release metadata by flavor, implement nightly/pr/post-merge/RC pipelines, update documentation, and add a commit-SHA embedding helper. ChangesRelease and Build Pipeline Workflows
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/actions/bootstrap-buildkit/action.yml:
- Around line 28-31: Normalize and trim each comma-separated token from the arch
input (the 'arch' input value) before generating platform/DNS strings and when
mapping to workers; ensure you call .trim() (or equivalent) on each token and
normalize forms like "arm64" to "linux/arm64" consistently. When computing
worker_addresses, only emit/set the output (worker_addresses) if every requested
arch token successfully resolves to a worker (i.e., require full resolution for
all requested arches), otherwise leave worker_addresses empty or fall back;
apply the same trimming+full-resolution guard to the other blocks that build
platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping
sections).
In @.github/actions/release-metadata/action.yml:
- Around line 101-105: The BASE_VERSION extraction currently restricts to a bare
X.Y.Z and fails when pyproject.toml contains suffixes (e.g. 1.2.3rc1); change
the parsing in the block that sets BASE_VERSION to capture the entire quoted
version string (including any suffix) instead of only digits-dots (i.e., grab
the value between the quotes using the existing grep/sed pipeline), then allow
the downstream release-flavor logic to trim or interpret suffixes as needed;
make the same change in the analogous parsing block around the later section
(the block currently at lines referenced in the review) so both places read the
full quoted version into BASE_VERSION for flavor-specific handling.
In @.github/workflows/pr.yaml:
- Around line 21-24: The workflow currently triggers on branches matching
'pull-request/*' and checks out and runs untrusted PR code on the privileged
self-hosted runner (prod-aiperf-default-v1) and later publishes with
AWS/Artifactory credentials; change this by splitting the pipeline into two
workflows or adding an explicit approval gate: move the checkout/build steps for
PR branches into an unprivileged PR workflow that uses GitHub-hosted runners and
never uses production credentials, and create a separate trusted
promotion/publish workflow that only runs on protected branches or after a
manual approval event and runs on prod-aiperf-default-v1 with AWS/Artifactory
credentials; ensure references to the 'pull-request/*' trigger, any checkout
actions, and the publish steps that use AWS/Artifactory credentials are updated
accordingly.
- Around line 47-48: The workflow uses mutable tags for external actions
(actions/checkout@v4 and astral-sh/setup-uv@v5) which allows retagged code to
run with repository secrets; replace each mutable tag occurrence (including the
second actions/checkout usage and the astral-sh/setup-uv usage) with the
corresponding immutable commit SHA for the exact release you trust, updating the
uses entries to the full "owner/repo@sha" form so the workflow runs the pinned
action commits only.
In @.github/workflows/rc.yaml:
- Around line 118-125: The workflow lookup uses RESPONSE (gh api call) and
filters only by head_sha (${COMMIT_SHA}), which can match commits across
branches; update the gh api call that builds RESPONSE to also pass the release
branch filter (add -f "branch=${RELEASE_BRANCH}" or the repo’s release branch
variable) so the returned workflow_runs are constrained to the intended release
branch before extracting RUN_ID; ensure the branch variable you add is
defined/propagated in the workflow context.
In `@CONTRIBUTING.md`:
- Line 201: Replace the incorrect CLI invocation "gh act -W
.github/workflows/pr.yaml" with the standalone act command; locate the
occurrence in CONTRIBUTING.md and change it to use "act -W
.github/workflows/pr.yaml" (or simply "act" as shown later) so the documentation
references the nektos/act binary rather than a non-existent gh subcommand.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3dfbf66a-03c8-431e-bd4a-1f8553d1c39b
📒 Files selected for processing (10)
.github/actions/bootstrap-buildkit/action.yml.github/actions/create-multiarch-manifest/action.yml.github/actions/release-metadata/action.yml.github/workflows/nightly.yml.github/workflows/post-merge.yaml.github/workflows/pr.yaml.github/workflows/rc.yamlATTRIBUTIONS-Python.mdATTRIBUTIONS-container.mdCONTRIBUTING.md
💤 Files with no reviewable changes (2)
- ATTRIBUTIONS-container.md
- .github/actions/create-multiarch-manifest/action.yml
| arch: | ||
| description: 'Comma-separated Docker platform(s): linux/amd64, linux/arm64, or linux/amd64,linux/arm64' | ||
| required: false | ||
| default: 'linux/amd64,linux/arm64' |
There was a problem hiding this comment.
Normalize arch tokens and require full resolution before taking the remote path.
The input docs allow linux/amd64, linux/arm64, but the parser never trims per-token whitespace, so the second entry becomes arm64 and generates invalid DNS/platform strings. Also, worker_addresses != '' accepts a partially resolved set, so a requested multi-arch build can proceed with only one worker instead of falling back. Trim each token before use and only emit worker_addresses when every requested arch resolves.
Suggested fix
WORKER_ADDRS=""
+ RESOLVED_COUNT=0
IFS=',' read -ra ARCHS <<< "$ARCH"
for arch in "${ARCHS[@]}"; do
+ arch="${arch//[[:space:]]/}"
DNS="buildkit-${arch}-0.buildkit-${arch}-headless.${NAMESPACE}.svc.cluster.local"
if nslookup "$DNS" >/dev/null 2>&1; then
WORKER_ADDRS="${WORKER_ADDRS:+${WORKER_ADDRS},}tcp://${DNS}:${PORT}"
+ RESOLVED_COUNT=$((RESOLVED_COUNT + 1))
echo "Resolved ${arch} worker: ${DNS}"
else
echo "No DNS for ${arch} worker (${DNS})"
fi
done
+
+ if [ "${RESOLVED_COUNT}" -ne "${`#ARCHS`[@]}" ]; then
+ WORKER_ADDRS=""
+ fi FIRST=true
IFS=',' read -ra ARCHS <<< "$ARCH"
for arch in "${ARCHS[@]}"; do
+ arch="${arch//[[:space:]]/}"
# Comma-containing values (nodeselector, tolerations) must be wrapped inAlso applies to: 84-100, 112-127, 134-140
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/actions/bootstrap-buildkit/action.yml around lines 28 - 31,
Normalize and trim each comma-separated token from the arch input (the 'arch'
input value) before generating platform/DNS strings and when mapping to workers;
ensure you call .trim() (or equivalent) on each token and normalize forms like
"arm64" to "linux/arm64" consistently. When computing worker_addresses, only
emit/set the output (worker_addresses) if every requested arch token
successfully resolves to a worker (i.e., require full resolution for all
requested arches), otherwise leave worker_addresses empty or fall back; apply
the same trimming+full-resolution guard to the other blocks that build
platforms/worker lists mentioned (the repeated arch-to-platform/worker mapping
sections).
| BASE_VERSION=$(grep -m1 -E '^version = "[0-9]+\.[0-9]+\.[0-9]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/') | ||
| if [[ -z "${BASE_VERSION}" ]]; then | ||
| echo "::error::Could not parse base version from pyproject.toml" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
The base-version parser rejects suffixes that release branches are supposed to preserve.
BASE_VERSION only matches bare X.Y.Z, but the post-merge-release path explicitly says release branches may keep rc/post/dev suffixes from pyproject.toml. A branch carrying 1.2.3rc1 will fail at Line 101 before it ever reaches the release-flavor logic. Parse the full quoted version here, then let each flavor decide how much of it to use.
Suggested fix
- BASE_VERSION=$(grep -m1 -E '^version = "[0-9]+\.[0-9]+\.[0-9]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')
+ BASE_VERSION=$(grep -m1 -E '^version = "[^"]+"' pyproject.toml | sed -E 's/^version = "([^"]+)".*/\1/')Also applies to: 132-145
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/actions/release-metadata/action.yml around lines 101 - 105, The
BASE_VERSION extraction currently restricts to a bare X.Y.Z and fails when
pyproject.toml contains suffixes (e.g. 1.2.3rc1); change the parsing in the
block that sets BASE_VERSION to capture the entire quoted version string
(including any suffix) instead of only digits-dots (i.e., grab the value between
the quotes using the existing grep/sed pipeline), then allow the downstream
release-flavor logic to trim or interpret suffixes as needed; make the same
change in the analogous parsing block around the later section (the block
currently at lines referenced in the review) so both places read the full quoted
version into BASE_VERSION for flavor-specific handling.
| - name: Checkout | ||
| uses: actions/checkout@v4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/pr.yamlRepository: ai-dynamo/aiperf
Length of output: 11991
Pin external actions to immutable SHAs to prevent privilege escalation.
This workflow has access to publishing credentials (AWS ECR and Artifactory tokens), but external actions actions/checkout@v4 (lines 48, 78) and astral-sh/setup-uv@v5 (line 81) use mutable version tags. If these actions are retagged or compromised upstream, they would inherit access to these secrets. Replace with full commit SHAs to ensure only the intended action code runs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/pr.yaml around lines 47 - 48, The workflow uses mutable
tags for external actions (actions/checkout@v4 and astral-sh/setup-uv@v5) which
allows retagged code to run with repository secrets; replace each mutable tag
occurrence (including the second actions/checkout usage and the
astral-sh/setup-uv usage) with the corresponding immutable commit SHA for the
exact release you trust, updating the uses entries to the full "owner/repo@sha"
form so the workflow runs the pinned action commits only.
| RESPONSE=$(gh api -X GET \ | ||
| "/repos/${GITHUB_REPOSITORY}/actions/workflows/post-merge.yaml/runs" \ | ||
| -f "head_sha=${COMMIT_SHA}" \ | ||
| -f "status=success" \ | ||
| -f "per_page=10") | ||
|
|
||
| RUN_ID=$(echo "${RESPONSE}" | jq -r '.workflow_runs | sort_by(.created_at) | reverse | .[0].id // empty') | ||
| if [ -z "${RUN_ID}" ]; then |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
GitHub REST API (workflow runs): Which query parameter should be used to filter runs by branch (branch/head_branch) when calling /actions/workflows/{workflow_id}/runs?
💡 Result:
To filter workflow runs by branch when calling the GitHub REST API endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs, you should use the branch query parameter [1]. The branch parameter accepts the name of the branch as a string [1]. When provided, it returns only the workflow runs associated with that specific branch [1]. Example usage: GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs?branch=main Note that while some GitHub API contexts may refer to head branches or references, the specific query parameter for this endpoint is consistently documented as branch [2][3][1].
Citations:
- 1: https://www.withone.ai/knowledge/github/conn_mod_def%3A%3AGJ3YfVeYEEY%3A%3A8cau0s5EQ8me4n-NqNU9iQ
- 2: https://docs.github.com/v3/actions/workflow-runs
- 3: https://docs.github.com/en/rest/actions/workflow-runs
🌐 Web query:
GitHub REST API workflow runs endpoint /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs parameters documentation
💡 Result:
The GitHub REST API endpoint to list workflow runs for a specific workflow is: GET /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs This endpoint allows you to retrieve a list of all runs for a particular workflow [1]. The {workflow_id} path parameter can be either the unique integer ID of the workflow or its file name (e.g., main.yaml) [1][2]. Common query parameters supported by this endpoint include: - actor (string): Filters results by the user who triggered the workflow run [1]. - branch (string): Filters results by the branch name [1]. - event (string): Filters results by the event that triggered the workflow (e.g., push, pull_request) [1]. - status (string): Filters results by the status or conclusion of the run (e.g., success, failure, in_progress) [3][4]. - per_page (integer): The number of results to return per page (max 100) [5]. - page (integer): The page number of the results to fetch [5]. - created (string): Filters runs created within a specific date-time range [3][4]. - exclude_pull_requests (boolean): If true, omits pull request-related runs from the response [3][4]. - check_suite_id (integer): Filters runs associated with a specific check suite [3][4]. - head_sha (string): Filters runs associated with a specific commit SHA [3][4]. For authentication, if the repository is private, you must use an access token with the repo scope [1]. Anyone with read access to the repository can use this endpoint [1].
Citations:
- 1: https://github.apidog.io/api-3489171
- 2: https://docs.github.com/rest/actions/workflows
- 3: https://docs.github.com/en/rest/actions/workflow-runs
- 4: https://docs.github.com/rest/actions/workflow-runs
- 5: https://docs.github.com/rest/actions/workflow-jobs
Constrain post-merge run discovery to the intended release branch.
Run lookup filters only by head_sha, so the same commit hash on multiple branches (e.g., main and release/) can cause the wrong artifact set to be selected for RC promotion.
Add the branch parameter to filter by the release branch:
Proposed fix
- name: Discover post-merge run for commit
id: discover
env:
GH_TOKEN: ${{ github.token }}
COMMIT_SHA: ${{ inputs.commit_sha }}
+ VERSION: ${{ inputs.release_version }}
run: |
set -euo pipefail
RESPONSE=$(gh api -X GET \
"/repos/${GITHUB_REPOSITORY}/actions/workflows/post-merge.yaml/runs" \
-f "head_sha=${COMMIT_SHA}" \
+ -f "branch=release/${VERSION}" \
-f "status=success" \
-f "per_page=10")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/rc.yaml around lines 118 - 125, The workflow lookup uses
RESPONSE (gh api call) and filters only by head_sha (${COMMIT_SHA}), which can
match commits across branches; update the gh api call that builds RESPONSE to
also pass the release branch filter (add -f "branch=${RELEASE_BRANCH}" or the
repo’s release branch variable) so the returned workflow_runs are constrained to
the intended release branch before extracting RUN_ID; ensure the branch variable
you add is defined/propagated in the workflow context.
| yamllint .github/workflows/ | ||
|
|
||
| # Smoke-test the bootstrap-buildkit action against a local docker daemon: | ||
| gh act -W .github/workflows/pr.yaml |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify act is standalone, not a gh subcommand
# Check if act is available as standalone command
if command -v act &> /dev/null; then
echo "✓ act is available as standalone command"
act --version
else
echo "✗ act not found"
fi
# Check if gh has act subcommand
if gh act --help &> /dev/null 2>&1; then
echo "✓ gh act subcommand exists"
else
echo "✗ gh act subcommand does not exist (expected)"
fiRepository: ai-dynamo/aiperf
Length of output: 122
🏁 Script executed:
cat -n CONTRIBUTING.md | sed -n '190,215p'Repository: ai-dynamo/aiperf
Length of output: 1136
🏁 Script executed:
cat CONTRIBUTING.md | grep -E '^\s*[|+\-*]|^\s*[^a-zA-Z0-9#\s\`]' | head -30Repository: ai-dynamo/aiperf
Length of output: 84
🏁 Script executed:
# Search for ASCII art diagrams (non-table lines with repeated special chars)
rg -n '^\s*[\+\-]{2,}|^\s*[|]{1,}\s*[|\-\s]{5,}' CONTRIBUTING.mdRepository: ai-dynamo/aiperf
Length of output: 135
🏁 Script executed:
# Check for actual ASCII art (box drawings, flow diagrams)
rg -n '^\s*\+[-=]+\+|^\s*[\|/\\][^\w]|^\s*[╔═╗║╚╝]' CONTRIBUTING.md | head -20Repository: ai-dynamo/aiperf
Length of output: 1256
🏁 Script executed:
# Check the full context around potential ASCII diagrams - look for lines with unusual patterns
cat CONTRIBUTING.md | grep -v '^|' | grep -v '^#' | grep -v '^\s*$' | grep -v '^\s*-' | head -40Repository: ai-dynamo/aiperf
Length of output: 2539
🏁 Script executed:
# Check entire file for actual ASCII art diagrams (not tables)
# Look for box-drawing patterns or flow diagrams
wc -l CONTRIBUTING.md && echo "---" && \
rg -i 'diagram|ascii|flow|chart|architecture' CONTRIBUTING.mdRepository: ai-dynamo/aiperf
Length of output: 879
Incorrect command: gh act is not valid.
The GitHub CLI (gh) does not have an act subcommand. The act tool is a standalone binary from nektos/act. Line 209 below correctly shows usage as just act without the gh prefix.
Proposed fix
-gh act -W .github/workflows/pr.yaml
+act -W .github/workflows/pr.yaml📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| gh act -W .github/workflows/pr.yaml | |
| act -W .github/workflows/pr.yaml |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@CONTRIBUTING.md` at line 201, Replace the incorrect CLI invocation "gh act -W
.github/workflows/pr.yaml" with the standalone act command; locate the
occurrence in CONTRIBUTING.md and change it to use "act -W
.github/workflows/pr.yaml" (or simply "act" as shown later) so the documentation
references the nektos/act binary rather than a non-existent gh subcommand.
GitHub Actions silently strips a job output whose value contains a registered secret, to prevent leaking the secret to downstream jobs. The release-metadata action's container_image and container_ref both embed the ECR registry hostname (built from secrets.AWS_ACCOUNT_ID + AWS_DEFAULT_REGION), so when pr.yaml and post-merge.yaml promoted them to the prepare job's top-level outputs, GitHub dropped them. The downstream build job then ran 'docker buildx build --tag ""' and failed with 'invalid tag "": repository name must have at least one component'. Observed in https://github.com/ai-dynamo/aiperf/actions/runs/26008818007 with annotations 'Skip output container_ref/container_image since it may contain secret.' Fix mirrors nightly.yml's pattern: only promote container_tag across job boundaries (no secret), and reconstruct the full ECR ref locally in each step that needs it from the AWS secret env vars the step already has access to. Changes: - release-metadata action: remove container_image and container_ref outputs; drop the now-unused container_registry/container_repository inputs; document the constraint in a comment near the GITHUB_OUTPUT block. - pr.yaml, post-merge.yaml: drop the two outputs from prepare's job outputs; pass container_tag instead and compute CONTAINER_REF locally in the build step ('CONTAINER_REF="${REGISTRY}/aiperf:${CONTAINER_TAG}"'). Summary steps now show container_tag (which is what you'd grep for in ECR anyway). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>
…sues) Repository in [project.urls] is meant to be the canonical source repo URL per PEP 621 convention (matches numpy, pandas, requests, scipy, … — all point at the repo root, not a specific commit). Until now the nightly/PR/post-merge pipelines mutated this field per build to a commit-specific URL, which is misuse: SBOM tools, pip show, and the Artifactory UI all expect 'where do I clone this from?', not 'what revision is this exact wheel?'. This commit puts the canonical URLs statically in pyproject.toml so they ship in every wheel's METADATA without per-commit rewriting: Homepage = "https://github.com/ai-dynamo/aiperf" Repository = "https://github.com/ai-dynamo/aiperf" Issues = "https://github.com/ai-dynamo/aiperf/issues" Per-commit traceability still lives where it belongs: - the wheel's version string carries the short SHA via PEP 440 local-version (e.g. 0.8.0.dev0+pr951.g88f1498) - aiperf.__commit_sha__ carries the full SHA at runtime (via src/aiperf/_build_info.py, which CI continues to generate) A follow-up commit removes the [project.urls] mutation from the embed-commit-sha CI script so this field never gets rewritten. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Replaces the inline Python heredoc in the nightly build-container job
(soon also in pr.yaml and post-merge.yaml — separate commit) with a
single script call:
python3 tools/embed_commit_sha.py "\${{ github.sha }}"
The script writes src/aiperf/_build_info.py so aiperf.__commit_sha__
returns the full SHA at runtime. It does NOT mutate pyproject.toml —
the canonical Repository URL is already static there (per the
preceding commit). The script lives in tools/ so it is a build-time
utility, not packaged inside the wheel.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Restructures pr.yaml and post-merge.yaml so wheel publication to Artifactory runs in its own job, gated by the existing automated-release GitHub environment, and uses the JFrog CLI instead of curl -T. WHY PR wheel filenames contain a '+' (PEP 440 local-version label, e.g. aiperf-0.8.0.dev0+pr951.g88f1498-py3-none-any.whl). Artifactory's generic-repo deploy handler rejected the curl PUT URL with HTTP 405, observed in run 26009595857 on PR #951. Same latent issue applies to post-merge-on-main wheels (which also use '+' via the alpha local label). 'jf rt upload' URL-encodes wheel filenames correctly and uses Artifactory's documented deploy endpoint. Putting the upload in its own job lets only that step run under the automated-release env (the build job stays ungated for fast iteration) and re-uses the post-merge two-job pattern that already worked. HOW build (no env, prod-aiperf-default-v1): Pushes the multi-arch runtime image to ECR. Populates cache-pr-<N> (or cache-post-merge) as a side effect of mode=max. stage-wheel (env: automated-release, prod-aiperf-default-v1): 1. Re-applies the same pyproject.toml + _build_info.py mutations build performed, so the wheel-builder cache key matches. 2. Attaches to the same builder_name as build, so it lands on the same K8s BuildKit pods (StatefulSet, shared PV cache). 3. Runs `buildx --target wheel-artifact --output type=local` with --cache-from only (no --cache-to). Every layer is hot from the build job, so only the scratch-export step actually runs; the wheel materializes locally as a near-instant op. 4. Validates the wheel (twine check + venv install + version check + aiperf --help). 5. Sets up the JFrog CLI with JF_URL derived by stripping '/artifactory/<repo>' from ARTIFACTORY_URL (keeps the existing secret format intact; nightly's curl path stays compatible). 6. jf rt upload --flat to '<repo>/<artifactory_subpath>'. post-merge.yaml drops the prior S3 intermediate entirely — the buildx-cache extraction makes it redundant. REQUIRED SECRET (must exist on the repo before this can run) ARTIFACTORY_REPO_NAME = sw-dynamo-aiperf-pypi-local ARTIFACTORY_URL stays as-is (no change needed); workflows derive the JFrog platform URL by stripping the repo suffix in shell. NOT CHANGED nightly.yml and rc.yaml still use curl -T. Nightly wheels (.dev<YYYYMMDD>) and RC wheels (literal pyproject version) never contain '+', so they keep working without code changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/nightly.yml (1)
401-416:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDead code:
matrix.archreferences in a non-matrix job will always skip these steps.The
build-containerjob was refactored from a matrix build to a single multi-platform buildx job, but several steps still referencematrix.arch == 'amd64'. Since there's no matrix defined,matrix.archis undefined and these conditions evaluate to false, causing the steps to be silently skipped:
- Line 402: Generate aiperf-nightly wheel variant
- Line 424: Validate wheels
- Line 472: Run unit tests against installed nightly wheel
- Line 511: Upload wheels to S3
This means nightly wheels are never validated, tested, or uploaded.
🐛 Proposed fix: Remove the obsolete matrix conditionals
- name: Generate aiperf-nightly wheel variant (amd64 only) - if: matrix.arch == 'amd64' run: |Apply the same removal to lines 424, 472, and 511 (or rename the conditions to document they always run on the single amd64 wheel build).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/nightly.yml around lines 401 - 416, The steps inside the build-container job that are gated by the non-existent matrix variable are dead; remove the obsolete "if: matrix.arch == 'amd64'" conditionals from the steps named "Generate aiperf-nightly wheel variant", "Validate wheels", "Run unit tests against installed nightly wheel" and "Upload wheels to S3" so they always run in the single multi-platform buildx job (or replace the condition with an appropriate runtime check if you intended to gate to amd64 specifically).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/embed_commit_sha.py`:
- Line 1: The script starting with the shebang line "#!/usr/bin/env python3" is
not executable which breaks the pre-commit hook; mark the file executable by
setting the executable bit (e.g., run chmod +x on the script or use git add
--chmod=+x when committing) so CI/pre-commit passes. Ensure the change is staged
and committed.
- Around line 27-36: The embed function currently calls
Path("src/aiperf/_build_info.py").write_text(...) without handling filesystem
errors; update embed to ensure the parent directory exists (create with
mkdir(parents=True, exist_ok=True)), then wrap the write_text call in a
try/except that catches OSError (and/or Exception), and on failure log or print
a clear error message including the exception details and the target path (use
build_info and sha), and exit/raise a controlled error rather than letting a raw
exception propagate; reference the embed function, the build_info Path variable,
and the write_text call when making the changes.
- Line 27: The embed(sha: str) function must validate its sha parameter before
use; add a guard that checks sha is a non-empty string and matches a reasonable
git SHA pattern (for example hex characters, 7–40 chars) and raise a clear
ValueError if it fails; update embed to perform this validation at the top
(validate non-empty and regex like ^[0-9a-fA-F]{7,40}$) so callers get an
explicit error instead of downstream failures.
---
Outside diff comments:
In @.github/workflows/nightly.yml:
- Around line 401-416: The steps inside the build-container job that are gated
by the non-existent matrix variable are dead; remove the obsolete "if:
matrix.arch == 'amd64'" conditionals from the steps named "Generate
aiperf-nightly wheel variant", "Validate wheels", "Run unit tests against
installed nightly wheel" and "Upload wheels to S3" so they always run in the
single multi-platform buildx job (or replace the condition with an appropriate
runtime check if you intended to gate to amd64 specifically).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 98303f33-6afa-440e-8f8b-8bda853dcd8e
📒 Files selected for processing (6)
.github/actions/release-metadata/action.yml.github/workflows/nightly.yml.github/workflows/post-merge.yaml.github/workflows/pr.yamlpyproject.tomltools/embed_commit_sha.py
✅ Files skipped from review due to trivial changes (1)
- pyproject.toml
| @@ -0,0 +1,48 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
The file has a shebang but is not marked executable, blocking CI.
The pre-commit hook failure indicates the file needs the executable bit set.
Fix by running:
chmod +x tools/embed_commit_sha.pyOr on Windows:
git add --chmod=+x tools/embed_commit_sha.py🧰 Tools
🪛 GitHub Actions: Pre-commit / 0_pre-commit.txt
[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed (exit code 1): file has a shebang but is not marked executable. If intended to be executable, run 'chmod +x tools/embed_commit_sha.py' (or on Windows use 'git add --chmod=+x ...').
🪛 GitHub Actions: Pre-commit / pre-commit
[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed: file has a shebang but is not marked executable. Try: chmod +x tools/embed_commit_sha.py (or git add --chmod=+x on Windows).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/embed_commit_sha.py` at line 1, The script starting with the shebang
line "#!/usr/bin/env python3" is not executable which breaks the pre-commit
hook; mark the file executable by setting the executable bit (e.g., run chmod +x
on the script or use git add --chmod=+x when committing) so CI/pre-commit
passes. Ensure the change is staged and committed.
| def embed(sha: str) -> None: | ||
| build_info = Path("src/aiperf/_build_info.py") | ||
| build_info.write_text( | ||
| "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n" | ||
| "# SPDX-License-Identifier: Apache-2.0\n" | ||
| '"""Build-time metadata. Generated by CI; do not edit."""\n' | ||
| "\n" | ||
| f'COMMIT_SHA = "{sha}"\n' | ||
| ) | ||
| print(f"Wrote {build_info} with COMMIT_SHA = {sha}") |
There was a problem hiding this comment.
Add error handling for filesystem operations.
The write_text() call can fail (permission denied, disk full, parent directory missing) and will raise an uncaught exception. This produces cryptic build failures instead of clear error messages.
🛡️ Proposed fix with error handling
def embed(sha: str) -> None:
build_info = Path("src/aiperf/_build_info.py")
- build_info.write_text(
- "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n"
- "# SPDX-License-Identifier: Apache-2.0\n"
- '"""Build-time metadata. Generated by CI; do not edit."""\n'
- "\n"
- f'COMMIT_SHA = "{sha}"\n'
- )
- print(f"Wrote {build_info} with COMMIT_SHA = {sha}")
+ try:
+ build_info.parent.mkdir(parents=True, exist_ok=True)
+ build_info.write_text(
+ "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n"
+ "# SPDX-License-Identifier: Apache-2.0\n"
+ '"""Build-time metadata. Generated by CI; do not edit."""\n'
+ "\n"
+ f'COMMIT_SHA = "{sha}"\n'
+ )
+ print(f"Wrote {build_info} with COMMIT_SHA = {sha}")
+ except OSError as e:
+ print(f"Failed to write {build_info}: {e}", file=sys.stderr)
+ raise SystemExit(1) from e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/embed_commit_sha.py` around lines 27 - 36, The embed function currently
calls Path("src/aiperf/_build_info.py").write_text(...) without handling
filesystem errors; update embed to ensure the parent directory exists (create
with mkdir(parents=True, exist_ok=True)), then wrap the write_text call in a
try/except that catches OSError (and/or Exception), and on failure log or print
a clear error message including the exception details and the target path (use
build_info and sha), and exit/raise a controlled error rather than letting a raw
exception propagate; reference the embed function, the build_info Path variable,
and the write_text call when making the changes.
| from pathlib import Path | ||
|
|
||
|
|
||
| def embed(sha: str) -> None: |
There was a problem hiding this comment.
Validate the SHA parameter.
The function does not validate that sha is non-empty or has a reasonable format. While CI workflows are expected to pass valid commit SHAs, a basic check improves robustness and produces clearer errors if called incorrectly.
✅ Proposed validation check
def embed(sha: str) -> None:
+ if not sha or not sha.strip():
+ raise ValueError("commit SHA cannot be empty")
build_info = Path("src/aiperf/_build_info.py")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/embed_commit_sha.py` at line 27, The embed(sha: str) function must
validate its sha parameter before use; add a guard that checks sha is a
non-empty string and matches a reasonable git SHA pattern (for example hex
characters, 7–40 chars) and raise a clear ValueError if it fails; update embed
to perform this validation at the top (validate non-empty and regex like
^[0-9a-fA-F]{7,40}$) so callers get an explicit error instead of downstream
failures.
The K8s-driver fallback in bootstrap-buildkit was using
'nodeselector=...,purpose=aiperf-build' — a label inherited from the
dynamo-cluster bootstrap pattern. aiperf's cluster labels its builder
pool with 'purpose=build' instead, so the fallback path had nowhere
to schedule (confirmed via FailedScheduling events:
'purpose In [aiperf-build] not in purpose In [build]').
This is dormant under normal operation (the remote-driver path
attaches to the healthy buildkit-{amd64,arm64} StatefulSet pods and
the fallback never runs). It only matters when the StatefulSet pods
are temporarily unreachable — at which point the fallback should
actually be able to land. Observed in nightly run 26018678419 where
the StatefulSet was rolling and the fallback couldn't recover.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Pairs with the preceding nodeselector fix. All 18 purpose=build nodes in aiperf's cluster carry the taint buildkit-worker=true:NoSchedule (verified via 'kubectl get nodes -l purpose=build -o custom-columns= ...,TAINTS:.spec.taints[*]'). The action's previous default tolerated buildkit-fallback-worker — a key inherited from dynamo's two-pool convention — which doesn't exist in aiperf's cluster. With this change, the K8s-driver fallback can actually land on the same nodes the buildkit StatefulSet runs on when the StatefulSet itself is rolling or unreachable. The nodes are otherwise idle in that condition (the StatefulSet pods are restarting / not consuming resources), so co-tenancy on those nodes during outages is the intended use of the fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The script has a shebang line (#!/usr/bin/env python3) but lacked the executable bit, which trips pre-commit's check-shebang-scripts-are- executable hook. Workflows invoke it as 'python3 tools/...' so the mode bit isn't strictly required for CI, but the hook's invariant holds: if a file has a shebang, declare it executable. chmod +x + git update-index --chmod=+x persists the mode change in the tree. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Summary
Introduces three new build pipelines (
pr,post-merge,rc) and refactorsnightlyto use the same dynamo-style multi-arch buildx pattern. Brings the K8s-pod-attached buildx pattern from PR #820 (and ultimately fromai-dynamo/dynamo) intomain, and gives every code path — PR, post-merge to main, post-merge to release branches, nightly, and release candidate — a defined artifact destination.PR #820's attribution work is already on
main(Dockerfile'spython-licensesstage); the hand-maintainedATTRIBUTIONS-*.mdfiles at the repo root are deleted as part of this change.What each workflow does
pr.yamlpull-request/*pr/pr<N>/aiperf:pr-<N>-<short_sha>post-merge.yamlmainorrelease/*post-merge/<run_id>/aiperf:main-<short_sha>oraiperf:release-<X.Y.Z>-<short_sha>nightly.ymlworkflow_dispatchnightly/aiperf:nightly-<YYYYMMDD>-<short_sha>→ NGC stagingrc.yamlworkflow_dispatch(manual)rc/<X.Y.Z>/rc<N>/aiperf:<X.Y.Z>-rc<N>Wheel versioning rules
<base>.dev0+pr<N>.g<short_sha>— PEP 440 local-version label.main:<base>a<run_number>+g<short_sha>— alpha pre-release.release/<X.Y.Z>: literalversion =frompyproject.toml, untouched. Release branches own their version via the project's shipping process — the release captain commits1.2.3,1.2.3rc0,1.2.3.dev0, etc., and the pipeline never rewrites it. The post-merge artifactory subpath disambiguates between commits.<base>.dev<YYYYMMDD>(unchanged).RC is promote-only
rc.yamldoes not rebuild. It takes acommit_shainput, discovers the post-merge.yaml run that built that commit viagh api, then:crane copythe multi-arch image from ECRaiperf:release-<X.Y.Z>-<short_sha>→ NGCaiperf:<X.Y.Z>-rc<N>.post-merge/<run_id>/<filename>→rc/<X.Y.Z>/rc<N>/<filename>.That means the wheel inside the container and on Artifactory is byte-identical to what post-merge produced; RC is purely a re-tag + re-publish. If you need a different wheel version for RC, commit it on the release branch first, push (post-merge builds it), then promote.
Environment gates
automated-releasenightly.ymlstaging,post-merge.yamlstage-wheel,rc.yamlstagingmanual-release-approvalrc.yamlapprovejobThe RC pipeline structure is:
approve(manual-release-approval) → three staging jobs (automated-release) → summary. Ifautomated-releasehas required reviewers configured, each staging job will independently prompt — that's a GitHub Actions UX wart, not a workflow issue.Dynamo buildx pattern
New composite action
.github/actions/bootstrap-buildkit/(ported fromai-dynamo/dynamo's same-named action) attaches docker buildx to per-arch BuildKit pods reachable via headless K8s DNS (buildkit-{arch}-0.buildkit-{arch}-headless.buildkit.svc.cluster.local:1234), with a Kubernetes-driver fallback. Onedocker buildx build --platform linux/amd64,linux/arm64 --pushagainst that builder produces the OCI index natively — no more per-arch jobs + manifest stitching.Used by
pr.yaml,post-merge.yaml, and the refactorednightly.yml. RC doesn't build, so it doesn't use this.All callers pass
fresh_builder: 'true'and agithub.run_id-keyed builder name to eliminate stale-context collisions across runs.Cache strategy per workflow
pr.yaml:--cache-towrites toaiperf:cache-pr-<N>only (PRs never write to other flavors' caches; concurrent PRs cannot blow each other's manifests away).--cache-fromreadscache-pr-<N>first, then falls back tocache-post-mergeso a brand-new PR's first build gets warm layers from main.post-merge.yaml: sharedaiperf:cache-post-merge(read + write).nightly.yml: no registry cache. Nightly is the rigorous build that revalidates the Dockerfile from scratch against the BuildKit pod's persistent layer cache only. Drift in pinned base images, system packages, or transitive deps surfaces here rather than being masked by a stale registry cache.rc.yaml: doesn't build; N/A.Files deleted
ATTRIBUTIONS-Python.md,ATTRIBUTIONS-container.md— superseded by Dockerfile'spython-licensesstage; auto-generated and shipped inside the container at/licenses/..github/actions/create-multiarch-manifest/— buildx now produces the manifest natively, no consumers left.Before merging
manual-release-approvalGitHub environment under Settings > Environments. Configure the release reviewer list. This must exist before any RC dispatch can succeed.automated-releaseenv has the secrets RC needs:NGC_PUBLISH_TOKEN,NGC_PUBLISH_USERNAME,NGC_STAGING_IMAGE_BASE,ARTIFACTORY_URL,ARTIFACTORY_USER,ARTIFACTORY_TOKEN,ARTIFACTORY_PYPI_URL,GITLAB_TRIGGER_TOKEN,GITLAB_PIPELINE_URL. These already exist for nightly; RC reuses them.aiperf:cache-pr-*images so closed PRs' caches expire after N days. Not blocking for merge.Test plan
pull-request/0branch; confirmpr.yamlproducespr/pr0/aiperf-...whlin artifactory andaiperf:pr-0-<sha>(multi-arch manifest) in ECR.main; confirmpost-merge.yamlproducespost-merge/<run_id>/aiperf-<base>a<run>+g<sha>-py3-none-any.whlandaiperf:main-<sha>; verify theautomated-releasegate pauses for approval.release/0.9.0branch withversion = "0.9.0"committed topyproject.toml; push a small change; confirm the wheel filename is exactlyaiperf-0.9.0-py3-none-any.whl(no alpha/rc suffix, no local-version label) — the workflow did not rewritepyproject.toml.nightly.yml; confirm a single build job replaces the prior two-job + manifest-stitch topology, anddocker buildx imagetools inspect aiperf:nightly-<date>-<sha>shows bothamd64andarm64manifests in the OCI index.manual-release-approval, dispatchrc.yamlwith a known commit SHA from a release branch; confirm the workflow finds the post-merge run viagh api, pauses onapprove, then after approval copies ECR → NGC and Artifactorypost-merge/...→rc/<X.Y.Z>/rc<N>/....🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Removed