Harden production promotion workflow#760
Conversation
🚀 Quick Review App CommandsWelcome! Here are the commands you can use in this PR:
|
WalkthroughDocumentation and GitHub Actions are updated to pin generated wrappers/workflows to a specific upstream commit SHA for promotion-hardening tests (leave ChangesControl Plane Flow Version Pinning and Production Promotion Hardening
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review — PR #760: Harden production promotion workflowOverviewThis PR adds meaningful hardening to the staging→production promotion path: workload-level env parity checks, safer image copy via Docker Buildx The temporary SHA pin ( Issues🐛 Redundant
|
| if [[ "${STAGING_IMAGE}" == *@* ]]; then | ||
| staging_image="${STAGING_IMAGE}" | ||
| else | ||
| staging_image="${STAGING_IMAGE%%@*}" | ||
| fi |
There was a problem hiding this comment.
Both branches of this if/else produce the same result.
- When
*@*matches:staging_image="${STAGING_IMAGE}"— the full string. - When it doesn't match:
${STAGING_IMAGE%%@*}removes the longest suffix starting at@, but since there is no@, the full string is returned unchanged.
This collapses to staging_image="${STAGING_IMAGE}" — the conditional is dead code.
If the intent is to strip a digest from a reference like app:tag@sha256:… and keep only app:tag, the logic needs to be inverted — strip the digest when @ is present:
| if [[ "${STAGING_IMAGE}" == *@* ]]; then | |
| staging_image="${STAGING_IMAGE}" | |
| else | |
| staging_image="${STAGING_IMAGE%%@*}" | |
| fi | |
| if [[ "${STAGING_IMAGE}" == *@* ]]; then | |
| staging_image="${STAGING_IMAGE%%@*}" | |
| else | |
| staging_image="${STAGING_IMAGE}" | |
| fi |
| No repository variables are required for the standard review-app path when | ||
| `.controlplane/controlplane.yml` has exactly one review app entry with | ||
| `match_if_app_name_starts_with: true`. cpflow infers the review-app prefix and | ||
| staging org from that config. |
There was a problem hiding this comment.
Two security-relevant paragraphs were removed from this section. Consider restoring them:
Fork PR token isolation (was between the token table and "No repository variables are required…"):
For public repositories, use a staging/review token that cannot access production Control Plane resources. Generated review-app deploys skip fork PR heads because Docker builds use repository secrets. If a forked change needs a review app, first move the reviewed change to a trusted branch in this repository.
Review app secret exposure (was before "Optional overrides exist…"):
Review apps run pull request code. Any value mounted through
cpln://secret/...can be read by that code after the workload starts, so keep review-app secret dictionaries limited to disposable databases, review-only renderer credentials, and license values that are acceptable for review-app exposure.
Both are actionable operator guidance — especially for public repos where untrusted contributors can open PRs. Removing them increases the chance of a misconfigured deployment that exposes production-grade secrets to review app code.
| | --- | --- | | ||
| | `DOCKER_BUILD_EXTRA_ARGS` | Newline-delimited extra Docker build tokens. | | ||
| | `DOCKER_BUILD_SSH_KEY` | Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key. | | ||
| | `DOCKER_BUILD_SSH_KEY` | Private SSH key for Docker builds that fetch private dependencies. | |
There was a problem hiding this comment.
The previous description was: Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key.
The new text drops "read-only, revocable" and the explicit warning against personal SSH keys. These are meaningful security constraints — a personal SSH key has much broader access than a deploy key, and the "revocable" property is important for incident response. Suggest restoring the original guidance:
| | `DOCKER_BUILD_SSH_KEY` | Private SSH key for Docker builds that fetch private dependencies. | | |
| | `DOCKER_BUILD_SSH_KEY` | Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key. | |
Greptile SummaryThis PR regenerates the cpflow GitHub Actions wrappers from upstream commit
Confidence Score: 4/5Safe to merge; the promotion workflow already ran successfully in production and the two code-level concerns are non-blocking. The bulk of the changes are mechanical ref-pin updates and documentation. The promotion workflow logic improvements (Buildx copy, workload env checks, duplicate guard, readyLatest gating) are solid additions. Two issues are worth follow-up: the >&2 on validate_control_plane_org's ::error:: call will silently lose the GitHub Actions annotation, and the staging-image if/else produces the same value in both branches, which may pass a digest string to cpln image get unexpectedly. Neither blocks merging given that the production run already succeeded.
Important Files Changed
Sequence DiagramsequenceDiagram
participant GH as GitHub Actions
participant CPLN_S as Control Plane Staging
participant CPLN_P as Control Plane Production
participant REG_S as Staging Registry
participant REG_P as Production Registry
GH->>GH: Validate confirm_promotion input
GH->>CPLN_P: Validate CPLN_TOKEN_PRODUCTION present
GH->>GH: Resolve workload names from controlplane.yml
GH->>CPLN_S: list_gvc_env_names
GH->>CPLN_P: list_gvc_env_names
GH->>GH: check_required_vars GVC parity
loop Each workload
GH->>CPLN_S: list_workload_env_names
GH->>CPLN_P: list_workload_env_names
GH->>GH: check_required_vars workload parity
end
GH->>CPLN_P: Capture current production image rollback state
GH->>CPLN_S: Capture deployed staging image
GH->>CPLN_S: cpln image get validate staging image
GH->>CPLN_P: cpln image query find next image number
GH->>REG_P: imagetools inspect pre-flight duplicate check
GH->>REG_S: docker login
GH->>REG_P: docker login
loop Copy retry loop
GH->>REG_S: imagetools inspect source
GH->>REG_P: imagetools create registry-to-registry copy
end
GH->>CPLN_P: cpflow deploy-image
GH->>CPLN_P: Wait for status.ready and status.readyLatest
alt Deployment healthy
GH->>GH: Create GitHub release
else Deployment failed
GH->>CPLN_P: Rollback workloads to captured images
GH->>CPLN_P: Poll rollback readiness parallel per workload
end
Reviews (1): Last reviewed commit: "Harden production promotion workflow" | Re-trigger Greptile |
| local display_value | ||
| display_value="$(printf '%q' "${value}")" | ||
| echo "::error::${label} (${display_value}) must be a valid Control Plane org name; use lowercase alphanumeric characters and hyphens only, with no leading or trailing hyphen." | ||
| echo "::error::${label} (${display_value}) must be a valid Control Plane org name; use lowercase alphanumeric characters and hyphens only, with no leading or trailing hyphen." >&2 |
There was a problem hiding this comment.
::error:: annotation lost to stderr
validate_control_plane_org is called directly — not inside a command substitution — so its stdout is not captured by any parent process. Redirecting the ::error:: message to stderr with >&2 means the GitHub Actions runner will not process it as a workflow annotation; the job still fails via exit 1, but no red error badge appears in the PR checks UI. The neighboring sanitize_control_plane_name correctly uses >&2 because it is called inside $(...), where capturing stdout is necessary; the same redirect here has the opposite effect.
| if [[ "${STAGING_IMAGE}" == *@* ]]; then | ||
| staging_image="${STAGING_IMAGE}" | ||
| else | ||
| staging_image="${STAGING_IMAGE%%@*}" | ||
| fi | ||
| if [[ -z "${staging_image}" ]]; then | ||
| echo "::error::Staging image '${STAGING_IMAGE}' did not contain a usable image reference." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Redundant if/else — both branches produce the same value
When STAGING_IMAGE contains @ the true-branch assigns staging_image="${STAGING_IMAGE}". When it does not contain @, the false-branch assigns staging_image="${STAGING_IMAGE%%@*}", which removes the longest suffix matching @*; since there is no @, the pattern cannot match and the full string is returned unchanged. In both cases staging_image ends up equal to STAGING_IMAGE.
If the intent was to strip @sha256:... for the cpln image get call that follows, the branches should be swapped — %%@* should be applied in the true-branch (when @ is present). As written, cpln image get receives a digest reference when the image was pulled by digest, which may or may not be accepted by the cpln API.
| exposure. | ||
|
|
||
| Optional overrides exist for forks, clones, and unusual apps: | ||
|
|
There was a problem hiding this comment.
Security guidance removed from user-facing documentation
This update removes two security-relevant paragraphs that were previously present:
- The fork-PR warning — that for public repositories the staging token must not be able to access production resources, and that forked PR code should be moved to a trusted branch before a review app is deployed — is now gone.
- The review-app secrets warning — that any secret mounted via
cpln://secret/...is readable by the running workload, and review-app secret dictionaries should be limited to disposable or low-sensitivity credentials — is also gone.
Additionally, the DOCKER_BUILD_SSH_KEY description was shortened from "Read-only, revocable deploy key for Docker builds that fetch private dependencies. Do not use a personal SSH key." to "Private SSH key for Docker builds that fetch private dependencies." removing the recommendation to use a deploy key over a personal key.
If these paragraphs were removed intentionally as part of the upstream generator update, consider preserving the security guidance as local additions to this file so future consumers of this template are not left without it.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # production promotions cannot derive and publish the same next tag. | ||
| latest_number="$( | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json | |
There was a problem hiding this comment.
Verify --max 0 semantics before merging.
If --max means "maximum results to return" and 0 is interpreted as "no results" (rather than "unlimited"), then .items will always be empty, jq will return max // 0 = 0, and every promotion will derive image number 1. The collision guard at line 537 would then abort every second promotion with "already exists", effectively breaking all future runs after the first.
Please confirm in the cpln CLI docs (or by a quick cpln image query --help) that --max 0 means "no limit / return all" rather than "return 0 items". If the semantics are uncertain, --max 9999 or omitting the flag entirely is safer until confirmed.
| staging_commit="" | ||
| if [[ "${staging_tag}" == *_* ]]; then | ||
| staging_commit="${staging_tag##*_}" | ||
| fi | ||
|
|
||
| # The workflow-level concurrency group serializes this sequence so two | ||
| # production promotions cannot derive and publish the same next tag. | ||
| latest_number="$( | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json | | ||
| jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \ | ||
| '[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0' | ||
| )" | ||
| production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))_${staging_commit}" | ||
| if ! [[ "${latest_number}" =~ ^[0-9]+$ ]]; then | ||
| echo "::error::Could not determine the next production image number for app '${PRODUCTION_APP_NAME}' in org '${CPLN_ORG_PRODUCTION}'." | ||
| exit 1 | ||
| fi | ||
|
|
||
| production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))" | ||
| if [[ -n "${staging_commit}" ]]; then | ||
| production_image="${production_image}_${staging_commit}" |
There was a problem hiding this comment.
Silent commit-suffix fallback reduces production image traceability.
The old code hard-failed if STAGING_IMAGE lacked a _<commit> suffix; the new code silently produces a production image with no commit hash (e.g. react-webpack-rails-tutorial-production:42 instead of ...:42_abc1234). That makes it harder to correlate a production image back to the source commit after the fact.
If the intent is to allow digest-referenced images (image@sha256:...), the fallback is understandable — but consider at least emitting a ::warning:: when staging_commit is empty so it's visible in the run log:
| staging_commit="" | |
| if [[ "${staging_tag}" == *_* ]]; then | |
| staging_commit="${staging_tag##*_}" | |
| fi | |
| # The workflow-level concurrency group serializes this sequence so two | |
| # production promotions cannot derive and publish the same next tag. | |
| latest_number="$( | |
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | |
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json | | |
| jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \ | |
| '[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0' | |
| )" | |
| production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))_${staging_commit}" | |
| if ! [[ "${latest_number}" =~ ^[0-9]+$ ]]; then | |
| echo "::error::Could not determine the next production image number for app '${PRODUCTION_APP_NAME}' in org '${CPLN_ORG_PRODUCTION}'." | |
| exit 1 | |
| fi | |
| production_image="${PRODUCTION_APP_NAME}:$((latest_number + 1))" | |
| if [[ -n "${staging_commit}" ]]; then | |
| production_image="${production_image}_${staging_commit}" | |
| staging_commit="" | |
| if [[ "${staging_tag}" == *_* ]]; then | |
| staging_commit="${staging_tag##*_}" | |
| fi | |
| if [[ -z "${staging_commit}" ]]; then | |
| echo "::warning::Staging image '${staging_image}' has no '_<commit>' suffix; production image will not include a commit hash." | |
| fi |
| if docker buildx imagetools inspect "${production_image_ref}" >/dev/null 2>&1; then | ||
| echo "::error::Production image '${production_image}' already exists in org '${CPLN_ORG_PRODUCTION}'; aborting to avoid overwriting it." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
TOCTOU note — relies on concurrency group being configured correctly.
There is a TOCTOU window between this inspect check and the imagetools create call at line 545: a second concurrent promotion could pass this guard and then both attempt to create the same tag. The inline comment at line 495 documents this dependency on the workflow-level concurrency group, which is the right mitigation.
One additional hardening option for the future: pass --tag with an explicit digest on imagetools create so that if the tag was already written the push fails at the registry level rather than silently succeeding or producing non-deterministic state. For now, the concurrency group is sufficient — just make sure it is set in the calling workflow.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13ef0716bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| CPLN_TOKEN="${token}" cpln workload get "${workload}" --gvc "${app}" --org "${org}" -o json | | ||
| jq -r '.spec.containers // [] | .[] | (.env // [])[]? | .name // empty' | | ||
| sort -u |
There was a problem hiding this comment.
Compare workload envs per container
For workloads with multiple containers, this flattens every container's env names into one workload-wide set, so production can pass the parity check even when the env is missing from the corresponding container (for example, staging web has DATABASE_URL, production web lacks it, but a sidecar has DATABASE_URL). That lets promotion proceed with a container that still differs from staging; include the container name in the compared key or compare each container separately.
Useful? React with 👍 / 👎.
Code Review: Harden production promotion workflowOverviewThis PR upgrades all
IssuesPR description references the wrong commit SHAThe first bullet says:
The code, readme, and the "Overview" section of the same PR description all use
|
Code Review: Harden production promotion workflowOverviewThis PR hardens the production promotion workflow with solid improvements:
The live promotion already succeeded, and the docs clearly explain the temporary SHA pinning. The overall structure is clean. What's Good
Issues Worth AddressingMedium — Verify
|
| # See the top-level concurrency group: cpflow-promote-staging-to-production. | ||
| latest_number="$( | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" --max 0 -o json | |
There was a problem hiding this comment.
Verify --max 0 semantics before shipping.
Some CLIs treat --max 0 as "return zero results." If cpln does that here, jq always evaluates max // 0 → 0, every promotion generates tag :1_<commit>, and the collision check on line ~541 immediately aborts all future runs.
Please confirm via cpln image query --help that 0 means "unlimited" (no pagination cap). If the flag is meant to disable pagination and the correct value is something else (e.g. no flag, or a large number), update accordingly.
| docker tag "${source_image_ref}" "${production_image_ref}" && | ||
| docker push "${production_image_ref}"; then | ||
| if docker buildx imagetools inspect "${source_image_ref}" >/dev/null && | ||
| docker buildx imagetools create --prefer-index=false --tag "${production_image_ref}" "${source_image_ref}"; then |
There was a problem hiding this comment.
--prefer-index=false silently drops multi-platform manifests.
This flag resolves an image index to a single-platform manifest before copying. For a single-arch tutorial image this is fine, but if the build pipeline ever produces a multi-arch image, the production copy will silently lose the non-primary platforms.
Consider either removing the flag (let imagetools create preserve the original index) or adding an inline comment explaining why single-manifest semantics are intentional for this repo:
| docker buildx imagetools create --prefer-index=false --tag "${production_image_ref}" "${source_image_ref}"; then | |
| docker buildx imagetools create --tag "${production_image_ref}" "${source_image_ref}"; then |
If --prefer-index=false is needed to work around a specific cpln registry quirk, document that reason here.
|
|
||
| export DOCKER_CONFIG="${docker_config_dir}" | ||
|
|
||
| if ! printf '%s' "${CPLN_TOKEN_STAGING}" | | ||
| docker login "${staging_registry}" -u '<token>' --password-stdin >/dev/null; then |
There was a problem hiding this comment.
Auth tokens not refreshed inside retry loop.
Moving both docker login calls before the retry loop is correct for the common case (avoids redundant auth on every attempt). However, if COPY_IMAGE_RETRY_INTERVAL × COPY_IMAGE_RETRIES is large enough that a Control Plane token expires mid-sequence, the retry will fail with a 401 rather than a retriable copy error, and the error message won't mention auth.
For the current default intervals this is benign. A short comment here noting the assumption that tokens remain valid for the full retry window would help future maintainers who raise the retry interval:
| export DOCKER_CONFIG="${docker_config_dir}" | |
| if ! printf '%s' "${CPLN_TOKEN_STAGING}" | | |
| docker login "${staging_registry}" -u '<token>' --password-stdin >/dev/null; then | |
| # Authenticate once before the retry loop. Tokens are assumed valid for | |
| # the full retry window (COPY_IMAGE_RETRY_INTERVAL × COPY_IMAGE_RETRIES). | |
| # If token expiry becomes an issue, move logins inside the retry loop. | |
| if ! printf '%s' "${CPLN_TOKEN_STAGING}" | |
| missing_vars="$(comm -23 <(printf '%s\n' "${staging_vars}") <(printf '%s\n' "${production_vars}"))" | ||
| production_only_vars="$(comm -13 <(printf '%s\n' "${staging_vars}") <(printf '%s\n' "${production_vars}"))" | ||
| list_workload_env_names() { | ||
| local token="$1" |
There was a problem hiding this comment.
Add a warning comment that CPFLOW_VERSION must stay unset while on the SHA pin.
Gemfile now pins cpflow = 5.1.0, but all workflow uses: refs point to the pre-release commit 2d8225572.... If someone sets the CPFLOW_VERSION repository variable to 5.1.0, the setup action's version-mismatch guard will abort the promotion.
A short inline note here would make the constraint visible at the call site:
| local token="$1" | |
| cpflow_version: ${{ vars.CPFLOW_VERSION }} | |
| # NOTE: CPFLOW_VERSION must be unset (not '5.1.0') while wrappers are | |
| # pinned to the pre-release SHA. See .github/cpflow-help.md § Version Locking. | |
| # The setup action validates this value against control_plane_flow_ref below. | |
| control_plane_flow_ref: shakacode/control-plane-flow/.github/workflows/cpflow-promote-staging-to-production.yml@2d8225572edd6f54c83ba9c51bd2983546989e93 |
✅ Review App DeletedReview app for PR #760 is deleted |
Summary
control-plane-flowcommit2d8225572edd6f54c83ba9c51bd2983546989e93, which includes PRs Harden production promotion image copy and readiness control-plane-flow#356, Address production promotion review follow-ups control-plane-flow#359, and Warn when staging image lacks commit suffix control-plane-flow#360.status.readyLatesthealth gating, restored review-app security guidance, missing commit-suffix warnings, and temporary commit-SHA pinning.cpflowgem pin to5.1.0.The live production promotion already succeeded in https://github.com/shakacode/react-webpack-rails-tutorial/actions/runs/26854234803; this PR carries the hardened generated workflow/docs forward for future runs.
Validation
git diff --checkbin/conductor-exec ruby -e 'require "yaml"; Dir[".github/workflows/*.yml"].sort.each { |path| YAML.load_file(path); puts path }'bin/conductor-exec bundle checkbin/conductor-exec bundle exec rubocopbin/conductor-exec bin/test-cpflow-github-flow ruby /private/tmp/control-plane-flow-docs.XP4Efz/bin/cpflowbin/conductor-exec bundle exec ruby -e 'require "cpflow/version"; puts Cpflow::VERSION'bin/conductor-exec bundle exec rspecwas attempted, but local RSpec cannot start without a Postgres server listening at/tmp/.s.PGSQL.5432in this environment. GitHub RSpec CI covers the branch.Note
High Risk
Changes production promotion (env parity, image copy, readiness gates) and pins all deploy workflows to a specific upstream commit; misconfiguration could block or mis-run promotions.
Overview
This PR hardens production promotion and aligns generated cpflow automation with upstream control-plane-flow hardening before it ships as a release tag.
Workflow pins: All generated
cpflow-*reusable workflows and the production promotion checkout move fromv5.0.4to upstream commit2d8225572edd6f54c83ba9c51bd2983546989e93. The bundledcpflowgem is bumped to5.1.0.Promotion workflow (
cpflow-promote-staging-to-production):imagetools(registry login, collision check, safer tag/commit parsing) instead of pull/tag/push.status.readyandstatus.readyLatestbefore treating workloads as ready.Docs: Control Plane guides add
cpflow apply-templatefor production after bootstrap/template changes, clarify temporary commit-SHA pinning vsv5.1.0, and generalizegh secretexamples toOWNER/REPO.Reviewed by Cursor Bugbot for commit 0115b68. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Documentation
Chores