Skip to content

Fix production promotion environment secret access#754

Merged
justin808 merged 4 commits into
masterfrom
jg-codex/pass-production-env-secret
Jun 2, 2026
Merged

Fix production promotion environment secret access#754
justin808 merged 4 commits into
masterfrom
jg-codex/pass-production-env-secret

Conversation

@justin808

@justin808 justin808 commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

  • replace the production promotion reusable workflow call with a caller-owned job declaring environment: production
  • check out pinned shakacode/control-plane-flow@v5.0.4 actions into .cpflow so the job can still reuse shared promotion logic
  • clarify docs and local helpers so CPLN_TOKEN_PRODUCTION stays environment-scoped and production ref pinning stays consistent

Related upstream generator PR: shakacode/control-plane-flow#353

Verification

  • git diff --check -- .github/workflows/cpflow-promote-staging-to-production.yml .github/cpflow-help.md .controlplane/shakacode-team.md .controlplane/readme.md bin/test-cpflow-github-flow bin/pin-cpflow-github-ref
  • bin/conductor-exec bin/test-cpflow-github-flow
  • temp pin-helper exercise rewrote both reusable workflow refs and the production .cpflow checkout/setup refs

Note

High Risk
Changes the live production deploy path, token scoping, and rollback behavior; misconfiguration could block promotion or leave production in a bad state during failed deploys.

Overview
Production promotion no longer delegates to the cross-repo control-plane-flow reusable workflow. The cpflow-promote-staging-to-production workflow now runs a caller-owned job with environment: production, checks out pinned v5.0.4 actions into .cpflow, and inlines the full promote pipeline (validation, staging→production image copy, deploy, health checks, multi-workload rollback, and a follow-up GitHub release job).

Docs (.controlplane/readme.md, shakacode-team.md, .github/cpflow-help.md) explain why CPLN_TOKEN_PRODUCTION must stay on the protected environment only (not repo/org secrets), warn against moving promotion back behind a reusable workflow, and add gh secret list troubleshooting. bin/pin-cpflow-github-ref and bin/test-cpflow-github-flow now pin and assert the production checkout/setup refs and forbid the old reusable-workflow promotion pattern.

Reviewed by Cursor Bugbot for commit 393b9c3. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Documentation

    • Clarified production-promotion behavior and protection for the production environment token; expanded troubleshooting and secret-listing guidance; improved version-locking and wrapper guidance.
  • Refactor

    • Replaced delegated promotion with an integrated promote-to-production pipeline that validates preconditions, captures images, copies staging to production, deploys with health checks, attempts rollbacks on failure, and writes a promotion summary; enforces single concurrent promotion.
  • Tests / Chores

    • Added stricter workflow validation to ensure promotions run in the production environment and added tooling to pin promotion workflow references.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 00bdd472-85af-4d26-9d83-2bd1432e2ceb

📥 Commits

Reviewing files that changed from the base of the PR and between 68a4a6c and 393b9c3.

📒 Files selected for processing (1)
  • bin/test-cpflow-github-flow
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/test-cpflow-github-flow

Walkthrough

This PR inlines the promote-to-production job gated by environment: production (so CPLN_TOKEN_PRODUCTION is approval-gated), adds health checks and rollback, captures/copies/deploys images, and updates docs and pin/validation scripts to enforce and document the new flow.

Changes

Production Promotion Environment and Workflow

Layer / File(s) Summary
Documentation: Production environment and token guidance
.controlplane/readme.md, .controlplane/shakacode-team.md, .github/cpflow-help.md
Clarifies that CPLN_TOKEN_PRODUCTION must be stored exclusively in the protected production GitHub Environment (not repo/org secrets), documents approval-gated availability via environment: production, expands troubleshooting (including verifying the job declares environment: production), updates gh secret example commands, and documents version-locking/pinning guidance for control-plane-flow checkout refs.
Production promotion workflow: inline implementation with health checks and rollback
.github/workflows/cpflow-promote-staging-to-production.yml
Replaces reusable-workflow delegation with an inline promote-to-production job: adds dispatch/concurrency/config, sets read-only default permissions, validates tokens and required vars, captures staging/production context and images, copies staging image to production, deploys (optional release-phase), polls deployment health, performs best-effort rollback on failure with parallel readiness polling, writes a promotion summary, and creates a release on success.
Helper scripts: workflow pinning and production promotion validation
bin/pin-cpflow-github-ref, bin/test-cpflow-github-flow
Extends pin-cpflow-github-ref to pin the production workflow's control_plane_flow_ref and the checkout step ref; updates test-cpflow-github-flow to reject calling the promote workflow via uses, require environment: production, require the "Checkout control-plane-flow actions" step with pinned with.ref and with.repository, and record refs for consistency checks.

Sequence Diagram(s)

sequenceDiagram
  participant CallerJob
  participant SetupProductionEnvironment
  participant ContainerRegistry
  participant ProductionDeployment
  participant HealthCheckComposite
  participant CreateGitHubRelease

  CallerJob->>SetupProductionEnvironment: validate CPLN_TOKEN_PRODUCTION & required secrets
  SetupProductionEnvironment->>ContainerRegistry: capture production/staging images metadata
  CallerJob->>ContainerRegistry: copy staging image -> production
  CallerJob->>ProductionDeployment: deploy copied image (optional release-phase)
  ProductionDeployment->>HealthCheckComposite: request health status polls
  HealthCheckComposite-->>CallerJob: report success/failure
  CallerJob->>CreateGitHubRelease: create release when promotion succeeds
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through workflows neat,

Tokens guarded where approvals meet;
Health checks hum and rollbacks tread light,
Pinned refs steady the pipeline's flight;
From staging to prod, the hop is right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix production promotion environment secret access' directly and clearly summarizes the main change: replacing cross-repo workflow calls with a caller-owned job to enable environment-scoped secret access.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jg-codex/pass-production-env-secret

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.controlplane/readme.md:
- Around line 96-107: The troubleshooting docs mention ruling out a same-named
repository or organization secret for CPLN_TOKEN_PRODUCTION but the example
commands only check the environment and repository scopes; update the
.controlplane/readme.md section referencing CPLN_TOKEN_PRODUCTION to add an
organization-level check (use the GitHub CLI org secret listing for the org)
alongside the existing environment and repository gh secret commands so
operators can detect an org secret that would mask a missing environment secret.

In @.github/workflows/cpflow-promote-staging-to-production.yml:
- Around line 14-35: The four hard-coded env vars (HEALTH_CHECK_RETRIES,
HEALTH_CHECK_INTERVAL, ROLLBACK_READINESS_RETRIES, ROLLBACK_READINESS_INTERVAL)
need to be wired to repository variables so overrides work; update each to read
from vars with a fallback default (e.g., change HEALTH_CHECK_RETRIES to use ${{
vars.HEALTH_CHECK_RETRIES || '24' }}, HEALTH_CHECK_INTERVAL to ${{
vars.HEALTH_CHECK_INTERVAL || '15' }}, and similarly for
ROLLBACK_READINESS_RETRIES and ROLLBACK_READINESS_INTERVAL) while keeping the
same numeric defaults and names so the workflow honors repo-level overrides.

In `@bin/test-cpflow-github-flow`:
- Around line 93-100: The current check only ensures a non-empty checkout ref
but doesn't verify the step actually uses actions/checkout or that it checks out
shakacode/control-plane-flow; update the validation around
checkout_step/checkout_ref to also read checkout_step["uses"] and
checkout_step.fetch("with", {})["repo"] and abort unless checkout_step["uses"]
includes "actions/checkout" and the repo equals "shakacode/control-plane-flow"
(or explicitly fail if repo is missing/incorrect), leaving refs[checkout_ref]
update only after those checks pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cefcc2db-590b-4374-a39c-fd81c48c2a12

📥 Commits

Reviewing files that changed from the base of the PR and between 80c8b0d and b7cdc73.

📒 Files selected for processing (6)
  • .controlplane/readme.md
  • .controlplane/shakacode-team.md
  • .github/cpflow-help.md
  • .github/workflows/cpflow-promote-staging-to-production.yml
  • bin/pin-cpflow-github-ref
  • bin/test-cpflow-github-flow

Comment thread .controlplane/readme.md
Comment thread .github/workflows/cpflow-promote-staging-to-production.yml Outdated
Comment thread bin/test-cpflow-github-flow
@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown

Greptile Summary

This PR solves a GitHub Actions limitation — environment secrets are not forwarded to called cross-repo reusable workflows — by replacing the single uses: delegation in the production promotion workflow with a fully inlined caller-owned job that declares environment: production directly. To preserve shared logic from the upstream control-plane-flow library, the job checks out that repo at the pinned tag (v5.0.4) into .cpflow and calls its composite actions locally.

  • The promotion job is substantially expanded inline: env parity check, rollback state capture, image copy, deployment, health-check wait, multi-workload rollback, and rollback-readiness polling. A separate create-github-release job carries the contents: write permission so the promotion job can stay at contents: read.
  • bin/pin-cpflow-github-ref gains two new gsub patterns to keep the production workflow's repository:/ref: checkout block and the control_plane_flow_ref setup input in sync when the upstream ref is bumped.
  • bin/test-cpflow-github-flow is updated to enforce the new structural invariants: no uses: key on the promotion job, environment: production declared, and a non-empty pinned ref on the Checkout control-plane-flow actions step.

Confidence Score: 4/5

The core architectural change is sound — environment secrets are now properly gated by the environment: production declaration on the caller job. The rollback and health-check logic is thorough and the permission scoping is correct.

The promotion workflow itself is well-structured: the environment gate is in the right place, rollback captures state before deployment, and the create-github-release job carries its own elevated permission. The two helper-script findings are non-blocking but worth tracking: the multi-line regex in pin-cpflow-github-ref can silently skip updating the production checkout ref if YAML formatting changes, and test-cpflow-github-flow will throw an opaque Ruby exception rather than a clean abort if the promotion job is ever renamed.

bin/pin-cpflow-github-ref (fragile multi-line regex) and bin/test-cpflow-github-flow (bare Hash#fetch on the job key) deserve a second look before merging.

Important Files Changed

Filename Overview
.github/workflows/cpflow-promote-staging-to-production.yml Major expansion: replaces reusable workflow call with an inline caller-owned job declaring environment: production. Adds rollback, health check, env parity check, and a separate create-github-release job with scoped contents: write permission.
bin/pin-cpflow-github-ref Adds two new gsub patterns for updating the production workflow checkout ref and setup input; the multi-line repository:…ref: regex silently skips the file if those YAML keys are not on consecutive lines.
bin/test-cpflow-github-flow Adds validation for the new inline job pattern; uses bare Hash#fetch on the job key which raises an opaque Ruby exception rather than a clean abort if the job is renamed.
.controlplane/readme.md Documentation updated to reflect the caller-owned job pattern and adds guidance on keeping CPLN_TOKEN_PRODUCTION absent from repo/org secrets.
.controlplane/shakacode-team.md Documentation updated consistently with readme.md.
.github/cpflow-help.md Help docs updated to explain why production promotion must stay as a caller-owned job and clarifies the version-locking strategy.

Reviews (1): Last reviewed commit: "Fix production promotion environment sec..." | Re-trigger Greptile

Comment thread bin/pin-cpflow-github-ref Outdated
text = File.read(path)
updated = text
.gsub(%r{(uses:\s+shakacode/control-plane-flow/\.github/workflows/[^@\s]+@)[^\s]+}, "\\1#{ref}")
.gsub(%r{(repository:\s+shakacode/control-plane-flow\n\s+ref:\s+)[^\s]+}, "\\1#{ref}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fragile multi-line regex — silent miss if lines are not adjacent

The pattern (repository:\s+shakacode/control-plane-flow\n\s+ref:\s+)[^\s]+ requires ref: to appear on the very next line after repository: shakacode/control-plane-flow with no intervening comment or blank line. If anyone inserts a comment (e.g. # pinned) between those two YAML keys, the regex will not match, the file will be silently skipped, and pin-cpflow-github-ref will report success without updating the production checkout ref. The only detection is running test-cpflow-github-flow afterward and noticing the refs-diverged abort.

Comment thread bin/test-cpflow-github-flow Outdated

promote_path = ".github/workflows/cpflow-promote-staging-to-production.yml"
promote_doc = YAML.load_file(promote_path, aliases: true)
promote_job = promote_doc.fetch("jobs", {}).fetch("promote-to-production")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Bare Hash#fetch raises an opaque Ruby exception if the job is renamed

promote_doc.fetch("jobs", {}).fetch("promote-to-production") (without a default or block) raises KeyError: key not found: promote-to-production instead of a clean abort message. If the job is ever renamed the test will still fail, but the error gives no hint that the script needs to be updated to match the new job name. Using fetch("promote-to-production") { abort "..." } or || {} with a follow-up check would produce a developer-friendly message consistent with the rest of the script.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +370 to +371
- name: Roll back on failure
if: failure() && steps.capture-current.outputs.rollback_state != '' && steps.capture-current.outputs.rollback_state != '{}'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "Wait for rollback readiness" runs even when rollback itself failed

Both if: conditions are identical — failure() && rollback_state != '' && rollback_state != '{}' — so "Wait for rollback readiness" runs whenever any earlier step failed, including when "Roll back on failure" itself exits 1. Because the rollback step has no id, there is no way to gate the readiness poll on rollback success. The step will poll all workloads regardless of whether their images were actually reverted, and may emit misleading "ready" status for workloads that still carry the broken image. Adding an id: rollback to the step and including steps.rollback.outcome == 'success' in the condition would make the intent explicit.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

🚀 Quick Review App Commands

Welcome! Here are the commands you can use in this PR:
They require the repository to have cpflow review apps configured, including the CPLN_TOKEN_STAGING secret.

+review-app-deploy

Deploy your PR branch for testing.

+review-app-delete

Remove the review app when done.

+review-app-help

Show detailed instructions, environment setup, and configuration options.

Comment +review-app-help for full setup details.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review: Fix production promotion environment secret access

Overview

This PR correctly addresses a fundamental GitHub Actions limitation: environment secrets from a caller's protected environment are not passed through to cross-repo reusable workflows. The solution — inlining the production promotion job with environment: production declared directly — is the right architectural fix. The concurrency: cancel-in-progress: false guard against mid-promotion cancellations is also well considered.

Security ✅

  • actions/checkout pinned by commit SHA (de0fac2e...) — good supply-chain hygiene
  • persist-credentials: false on both checkouts — correct
  • Secrets injected via env vars rather than CLI flags (no /proc/<pid>/cmdline leakage)
  • CPLN_TOKEN_PRODUCTION environment secret is validated before proceeding — good defense-in-depth

Issues Found

Bug — null image captured into rollback state (line 248 of workflow)

workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image')"

No null guard here. If a workload has no containers (e.g., a job-only workload), jq outputs the literal string "null" and the rollback state records it as the image. The staging capture step (line 302) already uses // empty — this should match.

Suspicious control_plane_flow_ref value (workflow line 132)
The value passed is a full workflow path: shakacode/control-plane-flow/.github/workflows/cpflow-promote-staging-to-production.yml@v5.0.4. This looks like a copy-carry from the deleted uses: ref. If cpflow-setup-environment expects a simple tag or owner/repo@ref format, this could be silently wrong or produce confusing error messages. Worth verifying the action's expected input format.

Misleading summary when deploy fails before health-check (workflow line 513)
If deploy-image fails, steps.health-check.outputs.healthy is empty (the step was skipped). The summary then falls through to the else branch and prints ❌ Status: deployment failed — accidentally correct, but only by coincidence. A skipped health-check and an explicit false are indistinguishable here.

Redundant env var in create-github-release (workflow line 535)
GITHUB_RUN_ID is already injected by GitHub Actions as a default environment variable in every step. Explicitly re-declaring it via env: is harmless but adds noise.

Fragile regex in bin/pin-cpflow-github-ref (line 62)
The regex (repository:\s+shakacode/control-plane-flow\n\s+ref:\s+)[^\s]+ assumes ref: is the line immediately after repository:. If the with: block fields are ever reordered (e.g., path: above ref:), the pattern silently fails and leaves the ref unpinned with no warning.

Minor Nits

  • The Roll back on failure step intentionally omits set -e to allow aggregate-and-continue semantics, but only says set -uo pipefail. Future maintainers may add -e back inadvertently — a short comment would help.
  • The release tag uses the runner's local time. date -u (UTC) would make timestamps unambiguous across runner locales.

Summary

The core approach is architecturally sound and the secret-scoping fix is correct. Two items worth addressing before merge: the missing // empty null guard on the production rollback image capture (inconsistent with the staging step), and verification of the control_plane_flow_ref expected format in cpflow-setup-environment.

[[ -n "${workload_name}" ]] || continue

workload_json="$(cpln workload get "${workload_name}" --gvc "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}" -o json)"
workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image')"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null guard — unlike the staging capture step (line 302) which uses // empty, this will output the literal string "null" if a workload has no containers. That string then propagates into the rollback state and the subsequent [[ -z "${selected_image}" ]] check won't catch it, leaving a stale "null" image as the rollback target.

Suggested change
workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image')"
workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image // empty')"

working_directory: .cpflow
cpln_cli_version: ${{ vars.CPLN_CLI_VERSION }}
cpflow_version: ${{ vars.CPFLOW_VERSION }}
control_plane_flow_ref: shakacode/control-plane-flow/.github/workflows/cpflow-promote-staging-to-production.yml@v5.0.4

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a carry-over from the deleted uses: ref. The value is a full workflow path (…/cpflow-promote-staging-to-production.yml@v5.0.4) rather than a simple tag or owner/repo@ref. If cpflow-setup-environment expects either format for version validation/logging, this should be verified against that action's action.yml — passing an unexpected format could be silently ignored or cause a confusing error if the action tries to parse it.

{
echo "## Promotion Summary"
echo
if [[ "${HEALTHY}" == "true" ]]; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deploy-image fails before the health-check step runs, steps.health-check.outputs.healthy is empty (the step was skipped). The else branch fires and prints ❌ Status: deployment failed — accidentally correct, but a skipped step and an explicit false result are indistinguishable here. Consider making the three states explicit:

Suggested change
if [[ "${HEALTHY}" == "true" ]]; then
if [[ "${HEALTHY}" == "true" ]]; then
echo "✅ Status: deployment successful"
elif [[ "${HEALTHY}" == "false" ]]; then
echo "❌ Status: health check failed"
else
echo "❌ Status: deployment failed before health check"
fi

env:
GH_REPO: ${{ github.repository }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_RUN_ID: ${{ github.run_id }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GITHUB_RUN_ID is already a default environment variable injected by GitHub Actions into every step — no need to re-declare it.

Suggested change
GITHUB_RUN_ID: ${{ github.run_id }}
GH_REPO: ${{ github.repository }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
STAGING_APP_NAME: ${{ needs.promote-to-production.outputs.staging_app_name }}
PRODUCTION_APP_NAME: ${{ needs.promote-to-production.outputs.production_app_name }}

Comment thread bin/pin-cpflow-github-ref Outdated
text = File.read(path)
updated = text
.gsub(%r{(uses:\s+shakacode/control-plane-flow/\.github/workflows/[^@\s]+@)[^\s]+}, "\\1#{ref}")
.gsub(%r{(repository:\s+shakacode/control-plane-flow\n\s+ref:\s+)[^\s]+}, "\\1#{ref}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex assumes ref: is the line immediately after repository: in the with: block. If the fields are ever reordered (e.g., path: moved above ref: in the checkout step), the pattern silently produces no match and the production checkout ref is left at the old value with no warning. Consider adding a post-update verification — e.g., grep the file for the old ref after the substitution loop and abort if any remain.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7a1005c20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

abort "#{promote_path}:promote-to-production must pin the Checkout control-plane-flow actions step"
end

refs[checkout_ref] << "#{promote_path}:promote-to-production"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check the setup action ref too

When the production workflow is repinned, this consistency check only adds the checkout step's ref to refs, but the workflow also has a second pin in cpflow-setup-environment's control_plane_flow_ref. If that setup ref is left on an older tag while the checkout/ref wrappers are updated, bin/test-cpflow-github-flow still reports a single ref even though production promotion can later fail in repos that set CPFLOW_VERSION, because the setup action validates that input against the checked-out control-plane-flow tag. Please include the setup action's control_plane_flow_ref in the same ref set.

Useful? React with 👍 / 👎.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review: Fix production promotion environment secret access

Overview

Well-reasoned architectural change. Because GitHub does not expose caller-repo environment secrets to cross-repo reusable workflows, production promotion has to run as a direct job in this repo. The PR correctly implements this with appropriate security defaults, parallel rollback polling, and variable-scoped permissions.

Strengths

  • environment: production at the job level — the core fix, correctly applied
  • persist-credentials: false on both checkouts — good hygiene
  • cancel-in-progress: false on the concurrency group — essential; a half-rolled deployment is worse than a queued one
  • CPLN_TOKEN_PRODUCTION and the upstream staging token passed via env: not CLI args — prevents exposure in /proc/<pid>/cmdline
  • Randomized heredoc delimiter (EOF_$(openssl rand -hex 8)) in the rollback state output — guards against injection from a workload name or image ref containing a literal "EOF" line
  • Parallel rollback readiness polling with wait aggregation — correct; worst-case wall time during an incident is retries × interval, not n_workloads × retries × interval
  • set -uo pipefail (intentionally without -e) in the rollback step — documented and correct; a single cpln hiccup shouldn't abort the loop before other workloads are attempted
  • contents: write scoped to the create-github-release job rather than global — correct least-privilege split

Issues

1. Ruby warn outputs to stderr — GHA annotations only parse stdout (Medium)

Lines 157–158 use Ruby's warn, which writes to stderr. GitHub Actions only parses ::error:: / ::warning:: annotations from stdout. When an app is absent from controlplane.yml, the failure message won't surface as a job annotation in the Actions UI — operators would need to scan raw logs to find it.

See inline comment on line 157.

2. "Wait for rollback readiness" shares the rollback step's if: condition (Low)

Both the rollback step and the readiness-wait step have the same condition. If rollback exits 1 early (e.g., the JSON state can't be parsed — line 389), the readiness-wait step still runs and polls workloads that were never actually rolled back. Adding an id: to the rollback step and including steps.<id>.outcome != 'failure' in the wait condition would make the intent explicit and avoid misleading output during a production incident.

See inline comment on line 441.

3. Hardcoded actions/checkout SHA in test script (Low)

EXPECTED_CPFLOW_CHECKOUT_ACTION in bin/test-cpflow-github-flow is a hardcoded constant that bin/pin-cpflow-github-ref does not update (it only handles shakacode/control-plane-flow refs). When the workflow's actions/checkout SHA is bumped for a security patch, this test constant will also need a manual update — without any tooling prompting that.

See inline comment on line 47 of bin/test-cpflow-github-flow.

Comment on lines +157 to +158
warn "Error: app '#{app}' is not defined under `apps:` in `.controlplane/controlplane.yml`."
warn " Fix the PRODUCTION_APP_NAME repository variable or add the app to controlplane.yml."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub Actions only parses ::error:: and ::warning:: annotations from stdout. Ruby's warn writes to stderr, so this message won't appear as a job annotation in the Actions UI when PRODUCTION_APP_NAME refers to an undefined app.

Suggested change
warn "Error: app '#{app}' is not defined under `apps:` in `.controlplane/controlplane.yml`."
warn " Fix the PRODUCTION_APP_NAME repository variable or add the app to controlplane.yml."
$stdout.puts "::error::app '#{app}' is not defined under `apps:` in `.controlplane/controlplane.yml`. Fix the PRODUCTION_APP_NAME repository variable or add the app to controlplane.yml."

One $stdout.puts with an ::error:: prefix covers both the annotation and the explanation, and keeps the pattern consistent with the puts "::error::..." calls below.

Comment on lines +441 to +442
- name: Wait for rollback readiness
if: failure() && steps.capture-current.outputs.rollback_state != '' && steps.capture-current.outputs.rollback_state != '{}'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step shares the same if: condition as the rollback step above. When the rollback step exits 1 early — for example because the JSON state couldn't be parsed (line 389) — this step still runs and polls workloads that were never actually rolled back, producing misleading output during a production incident.

Consider adding an id: rollback to the "Roll back on failure" step and extending this condition:

      - name: Wait for rollback readiness
        if: failure() && steps.rollback.outcome != 'failure' && steps.capture-current.outputs.rollback_state != '' && steps.capture-current.outputs.rollback_state != '{}'

That way readiness polling only runs when the rollback step actually attempted to update workloads.


CONTROL_PLANE_FLOW_WORKFLOW = %r{\Ashakacode/control-plane-flow/\.github/workflows/[^@\s]+@([^\s]+)\z}
PROMOTE_WORKFLOW = %r{\Ashakacode/control-plane-flow/\.github/workflows/cpflow-promote-staging-to-production\.yml@[^\s]+\z}
EXPECTED_CPFLOW_CHECKOUT_ACTION = "actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SHA is hardcoded and won't be updated by bin/pin-cpflow-github-ref, which only targets shakacode/control-plane-flow refs. When the workflow's actions/checkout SHA is bumped (e.g. for a security patch), this constant will also need a manual update — and there's nothing in the tooling to prompt that.

Consider either:

  • Adding a comment here noting it must be kept in sync with the uses: line in the workflow manually, or
  • Extending bin/pin-cpflow-github-ref to accept an --actions-checkout-sha option that updates both the workflow YAML and this constant.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Reviewed by Cursor Bugbot for commit a7a1005. Configure here.

Comment thread bin/test-cpflow-github-flow

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bin/test-cpflow-github-flow`:
- Around line 129-134: The abort message when setup_ref fails PROMOTE_WORKFLOW
should be more actionable: update the abort call in the block that reads
setup_ref = setup_step.fetch("with", {})["control_plane_flow_ref"] (and uses
PROMOTE_WORKFLOW and promote_path) to include the expected fully-qualified
workflow ref format (for example:
owner/repo/.github/workflows/<workflow>.yml@<version>) so the error clearly
shows a concrete example of the required .../.github/workflows/...yml@<version>
pinned ref.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6df16c81-86ed-4297-bf04-1d91e13e5a79

📥 Commits

Reviewing files that changed from the base of the PR and between a7a1005 and 68a4a6c.

📒 Files selected for processing (2)
  • .github/workflows/cpflow-promote-staging-to-production.yml
  • bin/test-cpflow-github-flow
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/cpflow-promote-staging-to-production.yml

Comment thread bin/test-cpflow-github-flow
@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR correctly addresses a real GitHub Actions constraint: environment secrets are not exposed to cross-repo reusable workflows, so CPLN_TOKEN_PRODUCTION couldn't reach the upstream promotion workflow. Moving to a caller-owned job with environment: production is the right fix.

Strengths

  • Least-privilege permissions: dropping to contents: read globally and scoping contents: write only to the release job is a security improvement.
  • Pinned action SHAs: both actions/checkout calls use a full commit SHA, which prevents tag-mutable supply chain attacks.
  • Concurrency control: cancel-in-progress: false is the correct choice — cancelling a half-finished promotion is worse than letting it complete.
  • Best-effort rollback: the rollback aggregates failures across all workloads rather than aborting on the first one, which prevents a partial-rollback state.
  • Token not passed on command line: CPLN_UPSTREAM_TOKEN via env in the copy step avoids the token appearing in /proc/<pid>/cmdline.
  • Random heredoc delimiter for rollback_state output prevents injection through data returned by the Control Plane API.

Issues

1. GITHUB_OUTPUT injection risk — staging image output

echo "image=${staging_image}" (line 334 in the workflow) writes an API-sourced value directly to GITHUB_OUTPUT without a random-delimiter heredoc. Unlike rollback_state, this path uses echo, so a newline in the image ref (however unlikely from a well-formed registry) could inject additional output variables or truncate the key. The rollback_state capture already uses the correct pattern — apply the same to staging_image, current_image, and current_version.

2. Hardcoded actions/checkout SHA in the test script

EXPECTED_CPFLOW_CHECKOUT_ACTION in bin/test-cpflow-github-flow is hardcoded to a specific SHA. When the workflow is updated to a newer actions/checkout release the constant must be updated manually and in sync. The test should derive this value from the workflow file itself rather than duplicating it:

checkout_action = Array(promote_job["steps"]).find { |s| s["name"] == "Checkout control-plane-flow actions" }&.fetch("uses", nil)
# then compare checkout_step["uses"] == checkout_action rather than a constant

Or at minimum add a comment directing maintainers to update this constant when pin-cpflow-github-ref is run.

3. Mixed stdout/stderr in Ruby error messages

In the Resolve production app workloads Ruby heredoc (lines 172–173), the workflow annotation (::error::) is emitted with puts (stdout) but the hint continuation is emitted with warn (stderr). GitHub Actions only processes workflow commands from stdout, so the hint line will appear in raw runner logs but not as part of the annotation. Either use puts for both, or consolidate into a single puts "::error::..." with the hint included.

4. Promotion summary is misleading on early failures

The Promotion summary step runs with if: always() and evaluates steps.health-check.outputs.healthy. If the job fails before the health-check step runs (e.g., during image copy or deploy), HEALTHY is empty, the [[ "${HEALTHY}" == "true" ]] branch falls through, and the summary reports "❌ Status: deployment failed" — accurate but vague. A quick improvement: also echo which step failed, or distinguish "deployment never reached health check" from "deployment reached health check but failed":

if [[ "${HEALTHY}" == "true" ]]; then
  echo "✅ Status: deployment successful"
elif [[ -z "${HEALTHY}" ]]; then
  echo "❌ Status: deployment failed before health check"
else
  echo "❌ Status: deployment failed health check"
fi

Minor Notes

  • The env-var parity check compares variable names only. Values aren't compared (they're secrets), which is intentional but worth a comment so future maintainers don't wonder if it's an oversight.
  • The rollback restores container images but not GVC-level config (env vars, etc.). If a deploy ever changes GVC config, rollback will be incomplete. Worth a comment in the rollback step noting this scope.
  • timeout-minutes: 45 is generous relative to the configured defaults (≤10 min health check + rollback window). Not harmful, but a comment explaining the headroom would help.

Verdict

The architectural change is sound and the implementation handles the failure path more carefully than the reusable workflow it replaces. Address the GITHUB_OUTPUT injection risk (issue 1) and the hardcoded SHA fragility (issue 2) before merging; the other items are improvements but not blockers.

CONTROL_PLANE_FLOW_WORKFLOW = %r{\Ashakacode/control-plane-flow/\.github/workflows/[^@\s]+@([^\s]+)\z}
PROMOTE_WORKFLOW = %r{\Ashakacode/control-plane-flow/\.github/workflows/cpflow-promote-staging-to-production\.yml@[^\s]+\z}
PROMOTE_WORKFLOW = %r{\Ashakacode/control-plane-flow/\.github/workflows/cpflow-promote-staging-to-production\.yml@([^\s]+)\z}
EXPECTED_CPFLOW_CHECKOUT_ACTION = "actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SHA must stay manually in sync with the actions/checkout ref in the workflow file — if pin-cpflow-github-ref (or a manual edit) updates the checkout SHA in the workflow, this constant needs a matching update or the test will fail with a confusing mismatch message.

Consider deriving the expected value from the workflow file itself so there's a single source of truth:

Suggested change
EXPECTED_CPFLOW_CHECKOUT_ACTION = "actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd"
# Read from the workflow file so this stays in sync automatically.
EXPECTED_CPFLOW_CHECKOUT_ACTION = begin
require "yaml"
promote = YAML.load_file(".github/workflows/cpflow-promote-staging-to-production.yml", aliases: true)
step = Array(promote.dig("jobs", "promote-to-production", "steps")).find { |s| s["name"] == "Checkout control-plane-flow actions" }
step&.fetch("uses") or abort "Could not find Checkout control-plane-flow actions step in promotion workflow"
end

Or at minimum add an inline comment: # Keep in sync with the checkout SHA in cpflow-promote-staging-to-production.yml.

staging_image="${staging_image_ref}"
fi

echo "image=${staging_image}" >> "$GITHUB_OUTPUT"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staging_image is derived from a CPLN API response via jq -r, which strips JSON quoting but preserves any embedded newlines in the value. Writing it with a bare echo means a newline in the image ref could inject additional key=value pairs into GITHUB_OUTPUT or silently truncate the value.

The rollback_state capture a few steps earlier already handles this correctly with a randomized heredoc delimiter. Apply the same pattern here for consistency and safety:

Suggested change
echo "image=${staging_image}" >> "$GITHUB_OUTPUT"
delim="EOF_$(openssl rand -hex 8)"
{
echo "image<<${delim}"
echo "${staging_image}"
echo "${delim}"
} >> "$GITHUB_OUTPUT"

Same applies to current_image and current_version in the Capture current production image step (lines 279–280).

Comment on lines +172 to +173
puts "::error::PRIMARY_WORKLOAD is not configured and app '#{app}' has multiple workloads: #{workloads.join(', ')}."
warn " Set the PRIMARY_WORKLOAD repository variable to one of these workloads."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub Actions workflow commands (::error::) are processed from stdout, but warn writes to stderr. The hint on line 173 will appear in the raw runner log but won't be attached to the annotation — operators checking the Actions UI will only see the first line.

Consolidate into a single puts call so the full message is captured as one annotation:

Suggested change
puts "::error::PRIMARY_WORKLOAD is not configured and app '#{app}' has multiple workloads: #{workloads.join(', ')}."
warn " Set the PRIMARY_WORKLOAD repository variable to one of these workloads."
puts "::error::PRIMARY_WORKLOAD is not configured and app '#{app}' has multiple workloads: #{workloads.join(', ')}. Set the PRIMARY_WORKLOAD repository variable to one of these workloads."

# via the HEALTH_CHECK_ACCEPTED_STATUSES repo variable to tighten this for apps that
# expose a dedicated health endpoint (e.g. "200" for a plain /health, or "200 401 403"
# for apps that auth-gate / without redirecting).
HEALTH_CHECK_ACCEPTED_STATUSES: ${{ vars.HEALTH_CHECK_ACCEPTED_STATUSES || '200 301 302' }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 200 301 302 default is well-explained in the comment, but a redirect from a maintenance page or an auth-gate passes this check even when the app is genuinely broken. The recommendation to expose /health is buried in a comment — consider making this more visible by emitting a ::warning:: at runtime when the default value is in use:

if [[ "${HEALTH_CHECK_ACCEPTED_STATUSES}" == "200 301 302" ]]; then
  echo "::warning::Using default HEALTH_CHECK_ACCEPTED_STATUSES (200 301 302). Consider exposing a /health endpoint and setting HEALTH_CHECK_ACCEPTED_STATUSES=200 in repository variables."
fi

This surfaces the guidance precisely when it matters rather than requiring operators to read the workflow source.

run: |
set -euo pipefail

staging_vars="$(CPLN_TOKEN="${CPLN_TOKEN_STAGING}" cpln gvc get "${STAGING_APP_NAME}" --org "${CPLN_ORG_STAGING}" -o json | jq -r '.spec.env // [] | .[].name' | sort)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parity check only compares GVC-level environment variables (.spec.env on the GVC object). Workload-level env vars configured directly on individual workloads — rather than the GVC — are not captured by cpln gvc get. Teams that configure env vars at the workload level would get a false "all clear" here. A comment noting this scope limitation would prevent future confusion:

# Note: this check covers GVC-level env vars only. Workload-level env vars
# (set on individual workloads rather than the GVC) are not compared here.

Comment on lines +326 to +332
if [[ "${staging_image_ref}" == /org/*/image/* ]]; then
staging_image="${staging_image_ref##*/image/}"
elif [[ "${staging_image_ref}" == *.registry.cpln.io/* ]]; then
staging_image="${staging_image_ref#*.registry.cpln.io/}"
else
staging_image="${staging_image_ref}"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else fallback silently passes the raw reference through as-is. If the image format is neither /org/*/image/* nor *.registry.cpln.io/*, cpflow copy-image-from-upstream --image will receive the full reference string, which may fail with an opaque error rather than a clear diagnostic. Consider adding an explicit warning:

if [[ "${staging_image_ref}" == /org/*/image/* ]]; then
  staging_image="${staging_image_ref##*/image/}"
elif [[ "${staging_image_ref}" == *.registry.cpln.io/* ]]; then
  staging_image="${staging_image_ref#*.registry.cpln.io/}"
else
  echo "::warning::Unrecognized image reference format '${staging_image_ref}'; passing through as-is."
  staging_image="${staging_image_ref}"
fi

create-github-release:
needs: promote-to-production
if: needs.promote-to-production.result == 'success'
runs-on: ubuntu-latest

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create-github-release job has no timeout-minutes. The default GitHub Actions timeout is 6 hours, so a hung gh release create (e.g. transient API outage) would block the workflow slot for hours after the production deploy has already succeeded. A short timeout is appropriate here:

  create-github-release:
    needs: promote-to-production
    if: needs.promote-to-production.result == 'success'
    runs-on: ubuntu-latest
    timeout-minutes: 5
    permissions:

status_name="${workload_name//\//_}"
status_file="${status_dir}/${status_name}"
if [[ ! -f "${status_file}" ]] || [[ "$(<"${status_file}")" != "ready" ]]; then
echo "::warning::Workload '${workload_name}' did not report ready after rollback."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "wait for rollback readiness" step emits only ::warning:: for workloads that never become ready — it does not exit non-zero. Since this step already executes in the failure() path, the overall job outcome is already a failure, so the warning won't silently hide a broken rollback. However, an operator scanning the step list might see a green checkmark on this step and believe the rollback completed cleanly. Consider adding a comment explaining the intent, or tracking a failure count and exiting non-zero so the step icon itself reflects the unready state.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review — PR #754: Fix production promotion environment secret access

Overview

This PR addresses a real and important GitHub Actions limitation: environment secrets set on a protected environment: are only exposed to the job that directly declares that environment — they are not propagated to cross-repo reusable workflows called by that job. The previous approach (passing production_environment: production to the upstream reusable workflow) relied on behavior that may not have worked as expected. Inlining the promotion job as a caller-owned step with environment: production is the correct fix.

The 550-line net addition is large but warranted: the inline job replicates and extends the upstream workflow's full promotion pipeline with local control.


Security ✅

  • persist-credentials: false on both actions/checkout calls — good practice.
  • Production token passed via env var, never interpolated into shell command arguments (see Copy image from staging).
  • contents: write is scoped only to the create-github-release job; the main job uses contents: read.
  • cancel-in-progress: false on the concurrency group is the right call — aborting a half-finished deploy-image is more dangerous than queuing.
  • The explicit check for CPLN_TOKEN_PRODUCTION emptiness before proceeding is a good early-exit guard.
  • The docs correctly warn against storing CPLN_TOKEN_PRODUCTION as a repo/org secret, and explain why (a broader secret with the same name masks a missing environment secret with no error).

Code Quality ✅

  • Rollback state uses a randomized heredoc delimiter (EOF_$(openssl rand -hex 8)) to prevent premature GITHUB_OUTPUT termination — this is the correct approach.
  • The parallel rollback-readiness polling with background subshells + status files is well-structured and correctly uses || true on wait to prevent one failing workload from aborting the parent.
  • YAML.safe_load(..., aliases: true) in the inline Ruby script is the right choice — safe_load prevents YAML code execution while aliases: true supports YAML anchors. The comment noting Ruby >= 3.1 is needed explains the ordering dependency clearly.
  • bin/test-cpflow-github-flow now forbids the old reusable-workflow promotion pattern outright, preventing regression.
  • bin/pin-cpflow-github-ref correctly updates all three ref locations (reusable-workflow uses:, production checkout ref:, and setup control_plane_flow_ref:).

Issues (see inline comments)

Severity Location Issue
Low HEALTH_CHECK_ACCEPTED_STATUSES (line 33) Default 200 301 302 can pass even when the app is broken — "strongly recommended" advice is easy to miss
Low Verify production environment variables (line 207) Check only covers GVC-level env vars; workload-level vars are not compared
Low Image ref parsing else fallback (lines 326–332) Unrecognized format passes through silently — could produce opaque downstream error
Low create-github-release job (line 537) Missing timeout-minutes; default is 6 hours
Nit Wait for rollback readiness (line 507) Step exits cleanly even if workloads remain unready — ::warning:: only, no step-level failure signal

None of these are blockers, but the create-github-release timeout and the image-ref fallback warning are straightforward to add.


What's Well Done

  • The architectural decision (caller-owned job, not cross-repo delegation) is the right call and is clearly documented with the why.
  • Docs in three separate files consistently explain the constraint and the anti-pattern to avoid.
  • The rollback implementation is robust: it validates that container sets haven't changed before patching images, aggregates failures without aborting early, and handles the multi-workload case correctly.
  • The GITHUB_STEP_SUMMARY promotion summary gives operators immediate context without digging through logs.

Overall this is a high-quality, well-reasoned change. The inline comments above are suggestions for hardening, not blocking issues.

@justin808 justin808 merged commit 0358b99 into master Jun 2, 2026
18 checks passed
@justin808 justin808 deleted the jg-codex/pass-production-env-secret branch June 2, 2026 05:35
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

✅ Review App Deleted

Review app for PR #754 is deleted

🎮 Control Plane Console
📋 View Workflow Logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant