Skip to content

Harden production promotion workflow#760

Merged
justin808 merged 3 commits into
masterfrom
jg-codex/harden-production-promotion-health
Jun 3, 2026
Merged

Harden production promotion workflow#760
justin808 merged 3 commits into
masterfrom
jg-codex/harden-production-promotion-health

Conversation

@justin808

@justin808 justin808 commented Jun 3, 2026

Copy link
Copy Markdown
Member

Summary

The live production promotion already succeeded in https://github.com/shakacode/react-webpack-rails-tutorial/actions/runs/26854234803; this PR carries the hardened generated workflow/docs forward for future runs.

Validation

  • git diff --check
  • bin/conductor-exec ruby -e 'require "yaml"; Dir[".github/workflows/*.yml"].sort.each { |path| YAML.load_file(path); puts path }'
  • bin/conductor-exec bundle check
  • bin/conductor-exec bundle exec rubocop
  • bin/conductor-exec bin/test-cpflow-github-flow ruby /private/tmp/control-plane-flow-docs.XP4Efz/bin/cpflow
  • bin/conductor-exec bundle exec ruby -e 'require "cpflow/version"; puts Cpflow::VERSION'

bin/conductor-exec bundle exec rspec was attempted, but local RSpec cannot start without a Postgres server listening at /tmp/.s.PGSQL.5432 in this environment. GitHub RSpec CI covers the branch.


Note

High Risk
Changes production promotion (env parity, image copy, readiness gates) and pins all deploy workflows to a specific upstream commit; misconfiguration could block or mis-run promotions.

Overview
This PR hardens production promotion and aligns generated cpflow automation with upstream control-plane-flow hardening before it ships as a release tag.

Workflow pins: All generated cpflow-* reusable workflows and the production promotion checkout move from v5.0.4 to upstream commit 2d8225572edd6f54c83ba9c51bd2983546989e93. The bundled cpflow gem is bumped to 5.1.0.

Promotion workflow (cpflow-promote-staging-to-production):

  • Env parity now compares staging vs production at both GVC and each configured workload’s container env names (not secret values).
  • Image copy uses Docker Buildx imagetools (registry login, collision check, safer tag/commit parsing) instead of pull/tag/push.
  • Health and rollback require status.ready and status.readyLatest before treating workloads as ready.

Docs: Control Plane guides add cpflow apply-template for production after bootstrap/template changes, clarify temporary commit-SHA pinning vs v5.1.0, and generalize gh secret examples to OWNER/REPO.

Reviewed by Cursor Bugbot for commit 0115b68. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Documentation

    • Updated Control Plane workflow guides for production bootstrap, promotion, and testing procedures.
    • Enhanced troubleshooting and CI automation documentation with improved version pinning guidance.
  • Chores

    • Updated cpflow gem to v5.1.0.
    • Updated GitHub workflows to reference newer upstream versions.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

🚀 Quick Review App Commands

Welcome! Here are the commands you can use in this PR:
They require the repository to have cpflow review apps configured, including the CPLN_TOKEN_STAGING secret.

+review-app-deploy

Deploy your PR branch for testing.

+review-app-delete

Remove the review app when done.

+review-app-help

Show detailed instructions, environment setup, and configuration options.

Comment +review-app-help for full setup details.

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Documentation and GitHub Actions are updated to pin generated wrappers/workflows to a specific upstream commit SHA for promotion-hardening tests (leave CPFLOW_VERSION unset while testing), many uses: refs repointed to the pinned commit, the promote workflow adds env-var parity checks and Buildx-based image publishing with retries, readiness requires both status.ready and status.readyLatest, and cpflow is bumped to 5.1.0.

Changes

Control Plane Flow Version Pinning and Production Promotion Hardening

Layer / File(s) Summary
Version locking and commit SHA pinning documentation
.controlplane/docs/testing-cpflow-github-actions.md, .controlplane/readme.md, .controlplane/shakacode-team.md, .github/cpflow-help.md
Docs updated to require leaving CPFLOW_VERSION unset while testing an immutable upstream commit SHA and to repin wrappers to a release tag only after the upstream hardening PR is released.
Persistent app bootstrap and promotion docs
.controlplane/readme.md, .controlplane/shakacode-team.md
Adds cpflow setup-app vs cpflow apply-template guidance, a production cpflow apply-template example, and documents env-var name parity checks and readiness gating.
Workflow reusable workflow reference repins
.github/workflows/cpflow-cleanup-stale-review-apps.yml, .github/workflows/cpflow-delete-review-app.yml, .github/workflows/cpflow-deploy-review-app.yml, .github/workflows/cpflow-deploy-staging.yml, .github/workflows/cpflow-help-command.yml, .github/workflows/cpflow-review-app-help.yml
All uses: references to shakacode/control-plane-flow reusable workflows are repinned from the prior tag to a specific upstream commit SHA.
Production promotion workflow logic enhancements
.github/workflows/cpflow-promote-staging-to-production.yml
Promotion workflow: checkout/setup refs pinned; improved malformed org-name display; computes and compares GVC and workload env var names (warn/fail behavior); adds Docker Buildx setup and Buildx imagetools-based image publish with retries; notes copy+deploy coupling; readiness polling requires both .status.ready and .status.readyLatest.
Cpflow gem version update
Gemfile
Bumps gem "cpflow" from 5.0.4 to 5.1.0.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I tuck a SHA beneath my paw,
Docs hum the steps I learned to draw,
Buildx hops and images soar,
Two ready flags, then open door,
Cpflow 5.1 — a tidy chore.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Harden production promotion workflow' directly reflects the main objective: updating and hardening the production promotion workflow with upstream control-plane-flow changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jg-codex/harden-production-promotion-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown

Code Review — PR #760: Harden production promotion workflow

Overview

This PR adds meaningful hardening to the staging→production promotion path: workload-level env parity checks, safer image copy via Docker Buildx imagetools (no layer pull), explicit registry login outside the retry loop, a collision guard on the production image tag, and dual status.ready + status.readyLatest rollback gating. The approach is sound and the live promotion success cited in the PR description is encouraging.

The temporary SHA pin (9ef104c246670d6c1ea4132dfd22be68ef930a70) is an acceptable short-term state as long as there's a clear plan to repin to a release tag once upstream PR #356 ships. The docs are upfront about this.


Issues

🐛 Redundant staging_image assignment (cpflow-promote-staging-to-production.yml ~line 470)

if [[ "${STAGING_IMAGE}" == *@* ]]; then
  staging_image="${STAGING_IMAGE}"
else
  staging_image="${STAGING_IMAGE%%@*}"
fi

Both branches assign the same value. When *@* matches, staging_image gets the full string. When it doesn't match, %%@* strips nothing (no @ present) and also returns the full string. This simplifies to staging_image="${STAGING_IMAGE}". If the intent was to strip a digest suffix only when no @ is present (e.g. foo:tag@sha256:…foo:tag), the logic is inverted — the branches need to be swapped.

🔒 Removed security guidance in cpflow-help.md

Two important security-oriented paragraphs were dropped from the Standard Setup section:

  1. Fork PR token isolation — the guidance that public repos should use a staging token that cannot reach production resources, and that fork PR heads are skipped because Docker builds use repository secrets. Without this, operators of public forks may grant overly-broad tokens or be surprised when fork PRs don't get review apps.

  2. Review app secret exposure — the warning that any value mounted via cpln://secret/… is readable by PR code once the workload starts, and that review-app secret dictionaries should be limited to disposable/review-only credentials. This is particularly important for public repos where untrusted contributors can open PRs.

Both should be restored (or the removal justified). Security guidance that prevents misconfiguration is worth keeping even if it's verbose.

🔒 SSH key description loses security guidance (cpflow-help.md line 146)

Before: Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key.
After: Private SSH key for Docker builds that fetch private dependencies.

Removing "read-only, revocable" and the explicit "Do not use a personal SSH key" warning weakens the operator guidance for a credential that ends up in CI secrets.

🔧 Minor: check_required_vars side-effects env_check_failed via global assignment

The helper function modifies env_check_failed from the outer scope without any declare -g or other signal — it relies on bash's implicit global mutation. This works correctly today but is a footgun if the function is ever called from a subshell (e.g. inside $(…)), where the assignment would be silently lost and exit "${env_check_failed}" would always exit 0 even on failure. A comment noting the global-mutation pattern would reduce future confusion.


What's good

  • docker buildx imagetools create avoids pulling image layers — much more efficient than pull/tag/push for large images.
  • Login and collision checks moved outside the retry loop — correct; these don't benefit from retrying on transient errors.
  • --max 0 on cpln image query avoids pagination truncation.
  • The readyLatest addition to rollback gating is a meaningful correctness improvement.
  • printf '%s' for token piping into docker login avoids shell quoting pitfalls.
  • comm + process substitution for env parity checks is clean and handles empty variable lists safely.

Comment on lines +470 to +474
if [[ "${STAGING_IMAGE}" == *@* ]]; then
staging_image="${STAGING_IMAGE}"
else
staging_image="${STAGING_IMAGE%%@*}"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both branches of this if/else produce the same result.

  • When *@* matches: staging_image="${STAGING_IMAGE}" — the full string.
  • When it doesn't match: ${STAGING_IMAGE%%@*} removes the longest suffix starting at @, but since there is no @, the full string is returned unchanged.

This collapses to staging_image="${STAGING_IMAGE}" — the conditional is dead code.

If the intent is to strip a digest from a reference like app:tag@sha256:… and keep only app:tag, the logic needs to be inverted — strip the digest when @ is present:

Suggested change
if [[ "${STAGING_IMAGE}" == *@* ]]; then
staging_image="${STAGING_IMAGE}"
else
staging_image="${STAGING_IMAGE%%@*}"
fi
if [[ "${STAGING_IMAGE}" == *@* ]]; then
staging_image="${STAGING_IMAGE%%@*}"
else
staging_image="${STAGING_IMAGE}"
fi

Comment thread .github/cpflow-help.md
No repository variables are required for the standard review-app path when
`.controlplane/controlplane.yml` has exactly one review app entry with
`match_if_app_name_starts_with: true`. cpflow infers the review-app prefix and
staging org from that config.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two security-relevant paragraphs were removed from this section. Consider restoring them:

Fork PR token isolation (was between the token table and "No repository variables are required…"):

For public repositories, use a staging/review token that cannot access production Control Plane resources. Generated review-app deploys skip fork PR heads because Docker builds use repository secrets. If a forked change needs a review app, first move the reviewed change to a trusted branch in this repository.

Review app secret exposure (was before "Optional overrides exist…"):

Review apps run pull request code. Any value mounted through cpln://secret/... can be read by that code after the workload starts, so keep review-app secret dictionaries limited to disposable databases, review-only renderer credentials, and license values that are acceptable for review-app exposure.

Both are actionable operator guidance — especially for public repos where untrusted contributors can open PRs. Removing them increases the chance of a misconfigured deployment that exposes production-grade secrets to review app code.

Comment thread .github/cpflow-help.md Outdated
| --- | --- |
| `DOCKER_BUILD_EXTRA_ARGS` | Newline-delimited extra Docker build tokens. |
| `DOCKER_BUILD_SSH_KEY` | Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key. |
| `DOCKER_BUILD_SSH_KEY` | Private SSH key for Docker builds that fetch private dependencies. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous description was: Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key.

The new text drops "read-only, revocable" and the explicit warning against personal SSH keys. These are meaningful security constraints — a personal SSH key has much broader access than a deploy key, and the "revocable" property is important for incident response. Suggest restoring the original guidance:

Suggested change
| `DOCKER_BUILD_SSH_KEY` | Private SSH key for Docker builds that fetch private dependencies. |
| `DOCKER_BUILD_SSH_KEY` | Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key. |

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown

Greptile Summary

This PR regenerates the cpflow GitHub Actions wrappers from upstream commit 9ef104c246670d6c1ea4132dfd22be68ef930a70 and bumps the local cpflow gem from 5.0.4 to 5.1.0, with the stated intent of hardening the production promotion workflow until the upstream changes land in a release tag.

  • Promotion workflow hardening: env-parity checks now cover both GVC-level and per-workload container env vars; Docker Buildx imagetools create replaces the pull/tag/push chain; Docker logins and a pre-flight duplicate-image guard are hoisted before the retry loop; rollback readiness now gates on both status.ready and status.readyLatest.
  • Workflow ref pinning: all five reusable uses: references and the direct actions/checkout ref in the promotion job are updated from v5.0.4 to the full 40-character commit SHA.
  • Docs updated: readme.md, shakacode-team.md, and testing-cpflow-github-actions.md document the temporary pin, GVC/workload env-name parity requirement, and the cpflow apply-template re-application command.

Confidence Score: 4/5

Safe to merge; the promotion workflow already ran successfully in production and the two code-level concerns are non-blocking.

The bulk of the changes are mechanical ref-pin updates and documentation. The promotion workflow logic improvements (Buildx copy, workload env checks, duplicate guard, readyLatest gating) are solid additions. Two issues are worth follow-up: the >&2 on validate_control_plane_org's ::error:: call will silently lose the GitHub Actions annotation, and the staging-image if/else produces the same value in both branches, which may pass a digest string to cpln image get unexpectedly. Neither blocks merging given that the production run already succeeded.

.github/workflows/cpflow-promote-staging-to-production.yml for the stderr redirect and staging-image extraction logic; .github/cpflow-help.md for removed security guidance.

Important Files Changed

Filename Overview
.github/workflows/cpflow-promote-staging-to-production.yml Major hardening: adds workload-level env-parity checks, Docker Buildx for registry-to-registry copy, pre-flight duplicate-image guard, and status.readyLatest health gating; contains a redundant if/else for staging image extraction and a misplaced >&2 on an error annotation.
.github/cpflow-help.md CI automation guide link updated to commit SHA and secret-management commands genericized; two security-warning paragraphs and deploy-key best-practice guidance removed.
.github/workflows/cpflow-cleanup-stale-review-apps.yml Single-line ref pin change from v5.0.4 to the upstream commit SHA; no logic changes.
.github/workflows/cpflow-delete-review-app.yml Single-line ref pin change from v5.0.4 to the upstream commit SHA; no logic changes.
.github/workflows/cpflow-deploy-review-app.yml Single-line ref pin change from v5.0.4 to the upstream commit SHA; no logic changes.
.github/workflows/cpflow-deploy-staging.yml Single-line ref pin change from v5.0.4 to the upstream commit SHA; no logic changes.
Gemfile Bumps cpflow gem pin from 5.0.4 to 5.1.0; Gemfile.lock updated consistently.
.controlplane/readme.md Documents new GVC+workload env-parity check, adds cpflow apply-template bootstrap snippet, and updates the temporary commit-SHA pinning rationale.
.controlplane/shakacode-team.md Adds cpflow apply-template re-application snippet and updates the pinning notes from release tag to temporary commit SHA.

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Actions
    participant CPLN_S as Control Plane Staging
    participant CPLN_P as Control Plane Production
    participant REG_S as Staging Registry
    participant REG_P as Production Registry

    GH->>GH: Validate confirm_promotion input
    GH->>CPLN_P: Validate CPLN_TOKEN_PRODUCTION present
    GH->>GH: Resolve workload names from controlplane.yml
    GH->>CPLN_S: list_gvc_env_names
    GH->>CPLN_P: list_gvc_env_names
    GH->>GH: check_required_vars GVC parity
    loop Each workload
        GH->>CPLN_S: list_workload_env_names
        GH->>CPLN_P: list_workload_env_names
        GH->>GH: check_required_vars workload parity
    end
    GH->>CPLN_P: Capture current production image rollback state
    GH->>CPLN_S: Capture deployed staging image
    GH->>CPLN_S: cpln image get validate staging image
    GH->>CPLN_P: cpln image query find next image number
    GH->>REG_P: imagetools inspect pre-flight duplicate check
    GH->>REG_S: docker login
    GH->>REG_P: docker login
    loop Copy retry loop
        GH->>REG_S: imagetools inspect source
        GH->>REG_P: imagetools create registry-to-registry copy
    end
    GH->>CPLN_P: cpflow deploy-image
    GH->>CPLN_P: Wait for status.ready and status.readyLatest
    alt Deployment healthy
        GH->>GH: Create GitHub release
    else Deployment failed
        GH->>CPLN_P: Rollback workloads to captured images
        GH->>CPLN_P: Poll rollback readiness parallel per workload
    end
Loading

Reviews (1): Last reviewed commit: "Harden production promotion workflow" | Re-trigger Greptile

local display_value
display_value="$(printf '%q' "${value}")"
echo "::error::${label} (${display_value}) must be a valid Control Plane org name; use lowercase alphanumeric characters and hyphens only, with no leading or trailing hyphen."
echo "::error::${label} (${display_value}) must be a valid Control Plane org name; use lowercase alphanumeric characters and hyphens only, with no leading or trailing hyphen." >&2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ::error:: annotation lost to stderr

validate_control_plane_org is called directly — not inside a command substitution — so its stdout is not captured by any parent process. Redirecting the ::error:: message to stderr with >&2 means the GitHub Actions runner will not process it as a workflow annotation; the job still fails via exit 1, but no red error badge appears in the PR checks UI. The neighboring sanitize_control_plane_name correctly uses >&2 because it is called inside $(...), where capturing stdout is necessary; the same redirect here has the opposite effect.

Comment on lines 470 to 478
if [[ "${STAGING_IMAGE}" == *@* ]]; then
staging_image="${STAGING_IMAGE}"
else
staging_image="${STAGING_IMAGE%%@*}"
fi
if [[ -z "${staging_image}" ]]; then
echo "::error::Staging image '${STAGING_IMAGE}' did not contain a usable image reference."
exit 1
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant if/else — both branches produce the same value

When STAGING_IMAGE contains @ the true-branch assigns staging_image="${STAGING_IMAGE}". When it does not contain @, the false-branch assigns staging_image="${STAGING_IMAGE%%@*}", which removes the longest suffix matching @*; since there is no @, the pattern cannot match and the full string is returned unchanged. In both cases staging_image ends up equal to STAGING_IMAGE.

If the intent was to strip @sha256:... for the cpln image get call that follows, the branches should be swapped — %%@* should be applied in the true-branch (when @ is present). As written, cpln image get receives a digest reference when the image was pulled by digest, which may or may not be accepted by the cpln API.

Comment thread .github/cpflow-help.md
exposure.

Optional overrides exist for forks, clones, and unusual apps:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Security guidance removed from user-facing documentation

This update removes two security-relevant paragraphs that were previously present:

  1. The fork-PR warning — that for public repositories the staging token must not be able to access production resources, and that forked PR code should be moved to a trusted branch before a review app is deployed — is now gone.
  2. The review-app secrets warning — that any secret mounted via cpln://secret/... is readable by the running workload, and review-app secret dictionaries should be limited to disposable or low-sensitivity credentials — is also gone.

Additionally, the DOCKER_BUILD_SSH_KEY description was shortened from "Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key." to "Private SSH key for Docker builds that fetch private dependencies." removing the recommendation to use a deploy key over a personal key.

If these paragraphs were removed intentionally as part of the upstream generator update, consider preserving the security guidance as local additions to this file so future consumers of this template are not left without it.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

# production promotions cannot derive and publish the same next tag.
latest_number="$(
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json |
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify --max 0 semantics before merging.

If --max means "maximum results to return" and 0 is interpreted as "no results" (rather than "unlimited"), then .items will always be empty, jq will return max // 0 = 0, and every promotion will derive image number 1. The collision guard at line 537 would then abort every second promotion with "already exists", effectively breaking all future runs after the first.

Please confirm in the cpln CLI docs (or by a quick cpln image query --help) that --max 0 means "no limit / return all" rather than "return 0 items". If the semantics are uncertain, --max 9999 or omitting the flag entirely is safer until confirmed.

Comment on lines +490 to +509
staging_commit=""
if [[ "${staging_tag}" == *_* ]]; then
staging_commit="${staging_tag##*_}"
fi

# The workflow-level concurrency group serializes this sequence so two
# production promotions cannot derive and publish the same next tag.
latest_number="$(
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json |
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json |
jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \
'[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0'
)"
production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))_${staging_commit}"
if ! [[ "${latest_number}" =~ ^[0-9]+$ ]]; then
echo "::error::Could not determine the next production image number for app '${PRODUCTION_APP_NAME}' in org '${CPLN_ORG_PRODUCTION}'."
exit 1
fi

production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))"
if [[ -n "${staging_commit}" ]]; then
production_image="${production_image}_${staging_commit}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent commit-suffix fallback reduces production image traceability.

The old code hard-failed if STAGING_IMAGE lacked a _<commit> suffix; the new code silently produces a production image with no commit hash (e.g. react-webpack-rails-tutorial-production:42 instead of ...:42_abc1234). That makes it harder to correlate a production image back to the source commit after the fact.

If the intent is to allow digest-referenced images (image@sha256:...), the fallback is understandable — but consider at least emitting a ::warning:: when staging_commit is empty so it's visible in the run log:

Suggested change
staging_commit=""
if [[ "${staging_tag}" == *_* ]]; then
staging_commit="${staging_tag##*_}"
fi
# The workflow-level concurrency group serializes this sequence so two
# production promotions cannot derive and publish the same next tag.
latest_number="$(
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json |
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json |
jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \
'[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0'
)"
production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))_${staging_commit}"
if ! [[ "${latest_number}" =~ ^[0-9]+$ ]]; then
echo "::error::Could not determine the next production image number for app '${PRODUCTION_APP_NAME}' in org '${CPLN_ORG_PRODUCTION}'."
exit 1
fi
production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))"
if [[ -n "${staging_commit}" ]]; then
production_image="${production_image}_${staging_commit}"
staging_commit=""
if [[ "${staging_tag}" == *_* ]]; then
staging_commit="${staging_tag##*_}"
fi
if [[ -z "${staging_commit}" ]]; then
echo "::warning::Staging image '${staging_image}' has no '_<commit>' suffix; production image will not include a commit hash."
fi

Comment on lines +537 to +540
if docker buildx imagetools inspect "${production_image_ref}" >/dev/null 2>&1; then
echo "::error::Production image '${production_image}' already exists in org '${CPLN_ORG_PRODUCTION}'; aborting to avoid overwriting it."
exit 1
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU note — relies on concurrency group being configured correctly.

There is a TOCTOU window between this inspect check and the imagetools create call at line 545: a second concurrent promotion could pass this guard and then both attempt to create the same tag. The inline comment at line 495 documents this dependency on the workflow-level concurrency group, which is the right mitigation.

One additional hardening option for the future: pass --tag with an explicit digest on imagetools create so that if the tag was already written the push fails at the registry level rather than silently succeeding or producing non-deterministic state. For now, the concurrency group is sufficient — just make sure it is set in the calling workflow.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13ef0716bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +273 to +275
CPLN_TOKEN="${token}" cpln workload get "${workload}" --gvc "${app}" --org "${org}" -o json |
jq -r '.spec.containers // [] | .[] | (.env // [])[]? | .name // empty' |
sort -u

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare workload envs per container

For workloads with multiple containers, this flattens every container's env names into one workload-wide set, so production can pass the parity check even when the env is missing from the corresponding container (for example, staging web has DATABASE_URL, production web lacks it, but a sidecar has DATABASE_URL). That lets promotion proceed with a container that still differs from staging; include the container name in the compared key or compare each container separately.

Useful? React with 👍 / 👎.

@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown

Code Review: Harden production promotion workflow

Overview

This PR upgrades all control-plane-flow GitHub Actions reusable-workflow refs from the v5.0.4 release tag to commit SHA 01dd1d231ce3d8849bcb7ed36b9fd9d184eb3350, bumps the cpflow gem to 5.1.0, and hardens the production promotion workflow in three main areas:

  1. Env-name parity checks — extended from GVC-level only to per-workload container level, with deferred error accumulation so all parity failures surface in a single run.
  2. Image copy — replaced pull/tag/push with docker buildx imagetools create (manifest-only copy, no layer download), with explicit registry login, an empty-tag guard, and a collision check.
  3. Readiness gating — rollback now waits for both status.ready and status.readyLatest before probing the public endpoint.

Issues

PR description references the wrong commit SHA

The first bullet says:

Pin the wrappers to merged upstream control-plane-flow commit 9ef104c246670d6c1ea4132dfd22be68ef930a70

The code, readme, and the "Overview" section of the same PR description all use 01dd1d231ce3d8849bcb7ed36b9fd9d184eb3350. The bullet should be corrected to avoid confusion about which upstream commit is actually pinned.

--max 0 on cpln image query (see inline comment)

This is the highest-risk change to verify. If --max 0 means "return 0 items" rather than "return all", the jq expression always produces 0, every promotion derives image tag :1, and the collision guard blocks the second promotion permanently. Please confirm the flag semantics before merging.

Silent staging_commit fallback (see inline comment)

The old code failed hard if the staging image had no _<commit> suffix; the new code silently omits the commit from the production image name. This reduces traceability. At minimum, emit a ::warning:: so the absence is visible in the run log.


Positive observations

  • Bash correctness: local missing_vars / local production_only_vars are declared on their own lines before the $(...) assignments — correctly avoids the bash gotcha where local var="$(failing_cmd)" masks the exit code.
  • env_check_failed mutation pattern is sound: bash functions run in the same process, so the comment at line 310 ("keep calls outside subshells") is accurate and the while read < <(...) process-substitution pattern keeps the loop body in the current shell.
  • echo "::error::..." >&2 fix in validate_org_name is a good catch — previously the error message went to stdout, which GitHub Actions may not annotate correctly from all contexts.
  • Registry login moved outside the retry loop — this is the right change; retrying a credential operation that just succeeded on the previous attempt was wasteful and could mask a real auth failure vs. a transient network issue.
  • --prefer-index=false on imagetools create preserves the original manifest type (OCI vs Docker), avoiding a silent schema downgrade during copy.
  • Collision guard before the copy loop prevents overwriting an existing production tag, which was previously unguarded.
  • status.readyLatest added to rollback check closes a real gap: status.ready alone can be true for the old revision while the new one is still rolling out.
  • Gemfile + Gemfile.lock updated together and the psych transitive bump is consistent.

Minor notes

  • OWNER/REPO in .github/cpflow-help.md: The gh secret examples were changed from the actual repo name to a generic placeholder. This is better for reusability when forking, but operationally the original concrete commands were easier to copy-paste. Consider keeping the concrete repo name in a comment alongside the generic form, or document it in the preceding prose.
  • Concurrency group dependency: The collision guard and the sequential number derivation both rely on the workflow-level concurrency group being correctly configured in the caller. It would be helpful to add a comment linking to or naming the expected concurrency group key so a future maintainer knows what to look for.

@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown

Code Review: Harden production promotion workflow

Overview

This PR hardens the production promotion workflow with solid improvements:

  • Env-var parity checks expanded from GVC-level to include each workload's container env
  • docker pull/tag/push replaced with docker buildx imagetools create (registry-level copy, no local pull)
  • Collision detection added before writing the production image tag
  • Readiness gate now requires both status.ready and status.readyLatest
  • All workflow uses: refs temporarily pinned to an unreleased upstream commit SHA while the hardening PRs clear the release process

The live promotion already succeeded, and the docs clearly explain the temporary SHA pinning. The overall structure is clean.

What's Good

  • Fetching workload status once and parsing both .status.ready and .status.readyLatest fields avoids a second API round-trip
  • The collision check before imagetools create prevents silent tag overwrites
  • env_check_failed accumulates failures across all workloads before a single exit — cleaner than early-exiting per workload
  • printf '%s' | docker login --password-stdin correctly avoids token exposure in /proc/<pid>/cmdline
  • set -euo pipefail throughout; >&2 added to the error-message echo
  • GVC + per-workload env parity is meaningfully stronger than GVC-only

Issues Worth Addressing

Medium — Verify --max 0 does not mean "zero results"

cpln image query ... --max 0 -o json (line 501 of the workflow file)

Some CLIs interpret --max 0 as "return zero items" rather than "no limit." If cpln does that here, the jq query always returns max // 0 = 0, every promotion generates tag :1_<commit>, and the collision check immediately blocks all future promotions. Please confirm via cpln image query --help or cpln docs that 0 means "unlimited" in this context.

Low — Docker auth not refreshed inside the retry loop

The docker login calls were moved before the retry loop (good for reducing redundant calls), but if COPY_IMAGE_RETRY_INTERVAL × COPY_IMAGE_RETRIES is large enough that a token expires mid-retry, the copy will fail with an auth error rather than a retriable copy error. For the current defaults this is benign, but a small note near the retry-interval config would help future maintainers.

Low — --prefer-index=false silently drops multi-platform manifests

docker buildx imagetools create --prefer-index=false copies the resolved single-platform manifest rather than the image index when the source is a multi-arch image. This is harmless for a single-arch tutorial app, but if the build pipeline ever goes multi-arch the production image will silently lose the other platforms. Worth a comment explaining why this flag is appropriate for this repo.

Info — Gem version vs. workflow SHA mismatch requires CPFLOW_VERSION to stay unset

Gemfile pins cpflow = 5.1.0 while all workflow uses: refs point to the pre-release commit 2d82255.... Any developer who also sets the CPFLOW_VERSION repository variable to 5.1.0 will trigger the version-mismatch guard in the setup action and break promotions silently. The docs cover this, but a short # CPFLOW_VERSION must remain unset — see cpflow-help.md comment near the cpflow_version: input in the promotion workflow would make it harder to miss.

# See the top-level concurrency group: cpflow-promote-staging-to-production.
latest_number="$(
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json |
cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify --max 0 semantics before shipping.

Some CLIs treat --max 0 as "return zero results." If cpln does that here, jq always evaluates max // 00, every promotion generates tag :1_<commit>, and the collision check on line ~541 immediately aborts all future runs.

Please confirm via cpln image query --help that 0 means "unlimited" (no pagination cap). If the flag is meant to disable pagination and the correct value is something else (e.g. no flag, or a large number), update accordingly.

docker tag "${source_image_ref}" "${production_image_ref}" &&
docker push "${production_image_ref}"; then
if docker buildx imagetools inspect "${source_image_ref}" >/dev/null &&
docker buildx imagetools create --prefer-index=false --tag "${production_image_ref}" "${source_image_ref}"; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--prefer-index=false silently drops multi-platform manifests.

This flag resolves an image index to a single-platform manifest before copying. For a single-arch tutorial image this is fine, but if the build pipeline ever produces a multi-arch image, the production copy will silently lose the non-primary platforms.

Consider either removing the flag (let imagetools create preserve the original index) or adding an inline comment explaining why single-manifest semantics are intentional for this repo:

Suggested change
docker buildx imagetools create --prefer-index=false --tag "${production_image_ref}" "${source_image_ref}"; then
docker buildx imagetools create --tag "${production_image_ref}" "${source_image_ref}"; then

If --prefer-index=false is needed to work around a specific cpln registry quirk, document that reason here.

Comment on lines 525 to +529

export DOCKER_CONFIG="${docker_config_dir}"

if ! printf '%s' "${CPLN_TOKEN_STAGING}" |
docker login "${staging_registry}" -u '<token>' --password-stdin >/dev/null; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auth tokens not refreshed inside retry loop.

Moving both docker login calls before the retry loop is correct for the common case (avoids redundant auth on every attempt). However, if COPY_IMAGE_RETRY_INTERVAL × COPY_IMAGE_RETRIES is large enough that a Control Plane token expires mid-sequence, the retry will fail with a 401 rather than a retriable copy error, and the error message won't mention auth.

For the current default intervals this is benign. A short comment here noting the assumption that tokens remain valid for the full retry window would help future maintainers who raise the retry interval:

Suggested change
export DOCKER_CONFIG="${docker_config_dir}"
if ! printf '%s' "${CPLN_TOKEN_STAGING}" |
docker login "${staging_registry}" -u '<token>' --password-stdin >/dev/null; then
# Authenticate once before the retry loop. Tokens are assumed valid for
# the full retry window (COPY_IMAGE_RETRY_INTERVAL × COPY_IMAGE_RETRIES).
# If token expiry becomes an issue, move logins inside the retry loop.
if ! printf '%s' "${CPLN_TOKEN_STAGING}" |

missing_vars="$(comm -23 <(printf '%s\n' "${staging_vars}") <(printf '%s\n' "${production_vars}"))"
production_only_vars="$(comm -13 <(printf '%s\n' "${staging_vars}") <(printf '%s\n' "${production_vars}"))"
list_workload_env_names() {
local token="$1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a warning comment that CPFLOW_VERSION must stay unset while on the SHA pin.

Gemfile now pins cpflow = 5.1.0, but all workflow uses: refs point to the pre-release commit 2d8225572.... If someone sets the CPFLOW_VERSION repository variable to 5.1.0, the setup action's version-mismatch guard will abort the promotion.

A short inline note here would make the constraint visible at the call site:

Suggested change
local token="$1"
cpflow_version: ${{ vars.CPFLOW_VERSION }}
# NOTE: CPFLOW_VERSION must be unset (not '5.1.0') while wrappers are
# pinned to the pre-release SHA. See .github/cpflow-help.md § Version Locking.
# The setup action validates this value against control_plane_flow_ref below.
control_plane_flow_ref: shakacode/control-plane-flow/.github/workflows/cpflow-promote-staging-to-production.yml@2d8225572edd6f54c83ba9c51bd2983546989e93

@justin808 justin808 merged commit 0c7afed into master Jun 3, 2026
15 checks passed
@justin808 justin808 deleted the jg-codex/harden-production-promotion-health branch June 3, 2026 06:15
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

✅ Review App Deleted

Review app for PR #760 is deleted

🎮 Control Plane Console
📋 View Workflow Logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant