Use native Control Plane image copy in production promotion#756
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThis PR refactors the production promotion workflow: the "Copy image from staging" step now validates staging tags, computes an incremented production image tag, authenticates with a temporary cpln upstream profile, copies the manifest into production with retries, and outputs the computed production image for reporting. ChangesProduction Image Promotion Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🚀 Quick Review App CommandsWelcome! Here are the commands you can use in this PR:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52f535e099
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fi | ||
|
|
||
| latest_number="$( | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | |
There was a problem hiding this comment.
Query all images before assigning the next tag
When the production org has more than the CLI's default result window of matching images, this query can miss the current highest numbered tag and compute a duplicate or stale production_image. I checked the Control Plane CLI common options, and query commands inherit --max with a default of 50; since production image retention is not configured in this repo, long-lived production orgs can exceed that and make promotions fail during cpln image copy (or deploy an unexpected older sequence). Add an explicit unbounded/large max when deriving latest_number.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/cpflow-promote-staging-to-production.yml:
- Line 390: Remove the argv-based secret by dropping the --token
"${CPLN_TOKEN_STAGING}" argument from the cpln profile create invocation and
instead set the CPLN_TOKEN environment variable to CPLN_TOKEN_STAGING for that
step (e.g., export or prefix the command with CPLN_TOKEN=${CPLN_TOKEN_STAGING})
so cpln profile create reads the token from CPLN_TOKEN rather than the command
line.
- Line 588: The workflow currently sets DEPLOYED_IMAGE to
steps.copy-image.outputs.image unconditionally, which misreports deployed state
after a failed promotion/rollback; change the workflow so DEPLOYED_IMAGE is set
only on the successful promotion path (e.g., when the promote step/job indicates
success) and on failure set DEPLOYED_IMAGE to the restored PREVIOUS_IMAGE (or
omit the deployed-image output) instead; update the assignment that references
DEPLOYED_IMAGE and steps.copy-image.outputs.image so it is gated by the
promotion success condition or overwritten with PREVIOUS_IMAGE in the
rollback/failure branch (refer to DEPLOYED_IMAGE,
steps.copy-image.outputs.image, and PREVIOUS_IMAGE to locate and adjust the
logic).
- Around line 391-392: The manifest probe using
`CPLN_PROFILE="${upstream_profile}" docker manifest inspect
"${source_image_ref}"` is outside the retry loop so transient registry/auth
errors can abort promotion before the copy retries run; move or duplicate the
`docker manifest inspect` step into the same retry loop used for the copy (use
the same `CPLN_PROFILE="${upstream_profile}"` context and exit-code handling) so
the manifest check is retried alongside the preserved-copy operations, ensuring
transient failures are recovered by the existing retry logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1c888c9b-69c8-4ade-ae36-4e38152a6b64
📒 Files selected for processing (1)
.github/workflows/cpflow-promote-staging-to-production.yml
Greptile SummaryThis PR replaces
Confidence Score: 4/5Safe to merge; the cross-org credential separation is correctly structured and credential cleanup is handled by an EXIT trap. Two narrow edge cases with the --cleanup retry interaction and underscore-in-app-name extraction are unlikely to surface in normal operation. The credential model is sound: staging token goes through a temporary profile, production uses the default profile set up earlier, and the trap ensures cleanup regardless of outcome. The
Important Files Changed
Sequence DiagramsequenceDiagram
participant GHA as GitHub Actions
participant CPLN_S as cpln (staging profile)
participant DockerS as Docker (staging registry)
participant CPLN_P as cpln (production/default profile)
GHA->>CPLN_S: cpln image get staging image (verify exists)
GHA->>CPLN_P: cpln image query production (get latest number)
GHA->>CPLN_S: cpln profile create upstream-run-attempt
GHA->>CPLN_S: cpln image docker-login --org staging
CPLN_S-->>DockerS: writes Docker credentials
GHA->>DockerS: docker manifest inspect source_image_ref (pre-flight)
DockerS-->>GHA: manifest OK
loop Up to copy_image_attempts
GHA->>CPLN_S: cpln image copy STAGING_IMAGE --profile upstream --org staging
CPLN_S->>CPLN_P: copy image cross-org to production --to-name production_image
CPLN_P-->>GHA: success / failure
end
GHA->>CPLN_S: cpln profile delete upstream-run-attempt (trap EXIT)
GHA->>GHA: "echo image=production_image >> GITHUB_OUTPUT"
|
| staging_commit="${STAGING_IMAGE##*_}" | ||
| if [[ "${staging_commit}" == "${STAGING_IMAGE}" || -z "${staging_commit}" ]]; then | ||
| echo "::error::Staging image '${STAGING_IMAGE}' does not include the expected '_<commit>' suffix." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
staging_commit extraction breaks for app names containing underscores
${STAGING_IMAGE##*_} strips the longest *_-prefixed match, meaning it operates on the entire string including the app-name portion before the colon. If the staging app name contains underscores (e.g., my_staging_app:3_abc1234), the extraction produces abc1234 (correct here), but for my_staging_app:latest it would produce app:latest — a value that passes the guard checks ("app:latest" != "my_staging_app:latest" and it is non-empty) but is an incorrect commit suffix embedded in production_image. CPLN app names conventionally use hyphens rather than underscores, so in practice this edge case is unlikely, but the extraction logic does not enforce this constraint.
| if cpln image copy "${STAGING_IMAGE}" \ | ||
| --profile "${upstream_profile}" \ | ||
| --org "${CPLN_ORG_STAGING}" \ | ||
| --to-profile default \ | ||
| --to-org "${CPLN_ORG_PRODUCTION}" \ | ||
| --to-name "${production_image}" \ | ||
| --cleanup; then |
There was a problem hiding this comment.
Risk: --cleanup in a retry loop can destroy the source before retries complete.
If the first copy attempt starts but fails mid-transfer, --cleanup may delete the source staging image. Every subsequent retry will then fail with a "source not found" error rather than a transient network error, defeating the retry logic entirely.
Consider removing --cleanup from the retry invocation and adding an explicit cleanup after a confirmed success:
| if cpln image copy "${STAGING_IMAGE}" \ | |
| --profile "${upstream_profile}" \ | |
| --org "${CPLN_ORG_STAGING}" \ | |
| --to-profile default \ | |
| --to-org "${CPLN_ORG_PRODUCTION}" \ | |
| --to-name "${production_image}" \ | |
| --cleanup; then | |
| if cpln image copy "${STAGING_IMAGE}" \ | |
| --profile "${upstream_profile}" \ | |
| --org "${CPLN_ORG_STAGING}" \ | |
| --to-profile default \ | |
| --to-org "${CPLN_ORG_PRODUCTION}" \ | |
| --to-name "${production_image}"; then | |
| copy_status=0 | |
| # Clean up source image only after confirmed success | |
| cpln image delete "${STAGING_IMAGE}" --org "${CPLN_ORG_STAGING}" --profile "${upstream_profile}" >/dev/null 2>&1 || true | |
| break |
If --cleanup means something other than deleting the source (e.g., removing temp layers only), a brief comment explaining its semantics would help reviewers.
| } | ||
| trap cleanup_upstream_profile EXIT | ||
|
|
||
| cpln profile create "${upstream_profile}" --token "${CPLN_TOKEN_STAGING}" >/dev/null |
There was a problem hiding this comment.
Potential failure if a stale profile exists from a crashed prior attempt.
cpln profile create will fail if a profile with this name already exists. If a runner is killed before the trap cleanup_upstream_profile EXIT fires (e.g., a SIGKILL, timeout, or infrastructure failure), the next re-run of the same attempt will hit RUN_ID + RUN_ATTEMPT collision.
Add a pre-create delete to make this idempotent:
| cpln profile create "${upstream_profile}" --token "${CPLN_TOKEN_STAGING}" >/dev/null | |
| cpln profile delete "${upstream_profile}" >/dev/null 2>&1 || true | |
| cpln profile create "${upstream_profile}" --token "${CPLN_TOKEN_STAGING}" >/dev/null |
| fi | ||
|
|
||
| latest_number="$( | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | |
There was a problem hiding this comment.
PRODUCTION_APP_NAME is interpolated directly into a regex filter.
The name~ operator likely interprets its value as a regular expression. If PRODUCTION_APP_NAME contains regex metacharacters (., +, [, etc.) the filter could match unintended images or cause a query error. Since you already have $prefix in the jq side for the exact startswith check, the server-side filter is just an optimistic pre-filter — consider using a literal prefix operator if the API supports one (e.g., name=^ or name~^${PRODUCTION_APP_NAME}:), or note that PRODUCTION_APP_NAME is expected to be alphanumeric-only.
Code ReviewOverviewThis PR replaces Positives
Issues (see inline comments for details)
Minor Notes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 316555de33
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| cleanup_upstream_profile | ||
| CPLN_TOKEN="${CPLN_TOKEN_STAGING}" cpln profile create "${upstream_profile}" >/dev/null | ||
| CPLN_PROFILE="${upstream_profile}" cpln image docker-login --org "${CPLN_ORG_STAGING}" >/dev/null |
There was a problem hiding this comment.
Override CPLN_TOKEN when using the staging profile
In the normal promotion job, the earlier setup action leaves the production secret in CPLN_TOKEN for later steps, and Control Plane's auth docs say token precedence is --token, then CPLN_TOKEN, then the profile token (https://docs.controlplane.com/cli-reference/get-started/authentication#token-precedence). With separate staging/production credentials, setting only CPLN_PROFILE/--profile here still authenticates staging registry operations with the production token, so docker-login/image copy fails against CPLN_ORG_STAGING; unset CPLN_TOKEN or set it to CPLN_TOKEN_STAGING for the source-profile commands.
Useful? React with 👍 / 👎.
Code ReviewOverviewThis PR replaces the opaque Positives
IssuesMedium —
|
| for attempt in $(seq 1 "${copy_image_attempts}"); do | ||
| if cpflow copy-image-from-upstream -a "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}" --image "${STAGING_IMAGE}"; then | ||
| if CPLN_PROFILE="${upstream_profile}" docker manifest inspect "${source_image_ref}" >/dev/null && | ||
| cpln image copy "${STAGING_IMAGE}" \ |
There was a problem hiding this comment.
CPLN_PROFILE is not read by Docker — it's a cpln-specific env var. Docker uses the credential store already populated by cpln image docker-login above. The prefix is a no-op here and should be removed to avoid confusion.
| cpln image copy "${STAGING_IMAGE}" \ | |
| if docker manifest inspect "${source_image_ref}" >/dev/null && |
| for attempt in $(seq 1 "${copy_image_attempts}"); do | ||
| if cpflow copy-image-from-upstream -a "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}" --image "${STAGING_IMAGE}"; then | ||
| if CPLN_PROFILE="${upstream_profile}" docker manifest inspect "${source_image_ref}" >/dev/null && | ||
| cpln image copy "${STAGING_IMAGE}" \ |
There was a problem hiding this comment.
The docker manifest inspect check runs on every retry attempt, but it only validates Docker auth — not something a retry would fix. If Docker auth is broken, every attempt fails on the inspect before the actual cpln image copy even runs, burning all retries for an error that cannot be recovered by retrying.
Consider moving the auth check (and the docker-login) outside the loop so it fails fast on auth errors while retries are preserved for transient copy failures:
| for attempt in $(seq 1 "${copy_image_attempts}"); do | |
| if cpflow copy-image-from-upstream -a "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}" --image "${STAGING_IMAGE}"; then | |
| if CPLN_PROFILE="${upstream_profile}" docker manifest inspect "${source_image_ref}" >/dev/null && | |
| cpln image copy "${STAGING_IMAGE}" \ | |
| copy_status=1 | |
| if ! docker manifest inspect "${source_image_ref}" >/dev/null; then | |
| echo "::error::Cannot reach staging image '${source_image_ref}' via Docker; check Docker login credentials." | |
| exit 1 | |
| fi | |
| for attempt in $(seq 1 "${copy_image_attempts}"); do | |
| if cpln image copy "${STAGING_IMAGE}" \ |
| latest_number="$( | ||
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | ||
| jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \ | ||
| '[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0' | ||
| )" |
There was a problem hiding this comment.
If cpln image query fails (API outage, rate limit, unexpected response format), set -e aborts here but the runner log will only show the raw cpln/jq error — no GitHub Actions annotation and no context about what was being attempted. Consider wrapping with an explicit error message:
| latest_number="$( | |
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | |
| jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \ | |
| '[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0' | |
| )" | |
| if ! latest_number="$( | |
| cpln image query --org "${CPLN_ORG_PRODUCTION}" --prop "name~${PRODUCTION_APP_NAME}:" -o json | | |
| jq -r --arg prefix "${PRODUCTION_APP_NAME}:" \ | |
| '[.items[].name | select(startswith($prefix)) | (try capture("^[^:]+:(?<number>[0-9]+)") catch empty) | .number | tonumber] | max // 0' | |
| )"; then | |
| echo "::error::Failed to query production image list to compute next tag number." | |
| exit 1 | |
| fi |
| echo "✅ Status: deployment successful" | ||
| deployed_image="${COPIED_IMAGE}" | ||
| else | ||
| echo "❌ Status: deployment failed" |
There was a problem hiding this comment.
If the copy-image step exits before emitting its output (e.g., fails during the cpln image query), COPIED_IMAGE will be empty. In that scenario HEALTHY is also not "true", so deployed_image would be set to PREVIOUS_IMAGE via the else branch — which is correct. However, if there's ever a path where HEALTHY == "true" but the output wasn't set, the summary would show Deployed image: \`` (empty backticks). A defensive fallback is cheap:
| echo "❌ Status: deployment failed" | |
| deployed_image="${COPIED_IMAGE:-unknown}" |
✅ Review App DeletedReview app for PR #756 is deleted |
Summary
cpln image copyCPLN_TOKEN_STAGING, authenticate Docker to the staging registry, and inspect the source manifest before copyingWhy
Fresh runs after #754/#755 prove that
CPLN_TOKEN_PRODUCTIONis now visible and rollback works, butcpflow copy-image-from-upstreamstill fails opaquely during the staging registry pull. Control Plane docs recommendcpln image copywith--to-profilefor cross-org copies with different credentials, which matches this promotion flow and gives Actions a clearer auth/copy boundary.Observed failed runs:
Verification
git diff --check HEAD~1..HEADbin/conductor-exec bin/test-cpflow-github-flowNote
Medium Risk
Changes only the production promotion workflow but alters cross-org registry auth, image naming, and what operators see as the deployed tag; a mis-copy or tag collision could block or mislabel production releases.
Overview
The Copy image from staging step no longer calls
cpflow copy-image-from-upstream. It now builds a new production image name (next numeric tag underPRODUCTION_APP_NAME:plus the commit suffix from the staging tag), uses a throwaway stagingcplnprofile anddocker manifest inspecton the staging registry ref, then copies withcpln image copyinto the production org (--to-profile default). Retries and failure handling stay the same; the step exposescopy-image.outputs.imagefor downstream reporting.The Promotion summary reports the copied production tag when the health check passes, and the previous production image when promotion fails—instead of echoing the staging image name as “deployed.”
Reviewed by Cursor Bugbot for commit 316555d. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit