feat: migrate app pipelines to release-resource based workflow#4506
feat: migrate app pipelines to release-resource based workflow#4506blarghmatey wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Migrates the k8s app docker+Pulumi Concourse pipeline generator away from the legacy release-candidate/release branch promotion model to a release-resource-driven workflow (release creation, QA deploy, GitHub issue gate, production deploy).
Changes:
- Simplifies git resources to just app main +
ol-infrastructure, removing RC/release branch git resources. - Introduces release workflow resources (release, release-gate, release-issue, GitHub Deployments) and a new
build-{app}-release-imagejob that bumps version + creates the release + publishes versioned images. - Rewires QA/Production Pulumi deployment chain to use the release resources for checklist/issue creation and deployment status reporting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| release_gate = github_issues( | ||
| name=Identifier(f"{app_name}-release-gate"), | ||
| repository=github_repo, | ||
| issue_prefix=f"Release {app_name}", | ||
| issue_title_template=f"Release {app_name}", | ||
| issue_state="closed", | ||
| skip_if_labeled=["abandoned"], |
There was a problem hiding this comment.
release_gate matches any closed issue with prefix/title Release {app_name} and Production uses only that as the manual gate. Because the issue title/prefix don’t include the release version, closing an older outstanding release issue (or having multiple releases in flight) can promote the wrong image tag to production. Include the version in the issue title/prefix (e.g., Release {app_name} vX.Y.Z) and/or pass the gated version through so Production deploys the exact release being approved.
There was a problem hiding this comment.
Good catch — this is a real design risk worth documenting clearly. Here's the detailed breakdown and the options for addressing it.
The concrete failure scenario
- Release 0.24.0 ships: QA deploys, issue "Release {app}" is opened, team closes it → Production deploys 0.24.0. ✅
- Release 0.25.0 ships the next day. QA deploys, a new "Release {app}" issue is opened.
- Someone notices the still-open 0.24.0 issue from step 1 was never closed (or an old abandoned release issue exists) and closes it.
release_gatefires on that closure → Production deploys... but which image? The shareddependenciesimageGetStephaspassed=[qa_job_name], so Concourse picks the latest image version that successfully passed through QA — which at this point is 0.25.0, not 0.24.0. So closing the stale 0.24.0 issue accidentally triggers a 0.25.0 prod deploy before the team signed off.
The passed constraint prevents deploying untested images, but it doesn't pin the gate to the specific release being approved.
Partial mitigation already in place
- The
skip_if_labeled: [abandoned]onrelease_gatehandles the explicit-abandonment case. - Any prod deploy triggered prematurely still deploys a QA-validated image, so correctness isn't broken — just the timing control is.
Approaches to fully resolve it
Option A — Include version in the issue title (recommended)
Pass ((.:release_version)) into the release_issue put step's title so issues are named "Release {app} v0.25.0". The release_gate resource would need a matching issue_prefix, but since the prefix must be static at pipeline-generation time, this only works if we update the github_issues resource to support a runtime-variable prefix — or if we use set_pipeline to re-emit the pipeline with the version baked in (complex).
Verdict: Clean conceptually but requires either a dynamic pipeline re-emit or a github_issues resource that supports runtime prefixes. Not straightforward today.
Option B — Use version: every on the release_gate
Add version="every" to GetStep(get=release_gate.name). This tells Concourse to process each new closed-issue version separately rather than jumping to the latest. Each release issue closure would fan-trigger at most one Production deployment. It doesn't fully prevent the wrong-issue problem but ensures no closure is silently skipped or batched with another.
Verdict: Low-effort improvement, reduces (but doesn't eliminate) the race. Easy to add now.
Option C — Make release_gate the same resource as release_issue with passed constraint
If release_gate and release_issue are the same underlying resource (same name), a GetStep on the gate with passed=[qa_job_name] would require the specific issue version that was created in QA to now be in the closed state. This is the most watertight solution.
Concretely: rename release_issue to e.g. {app}-release-checklist and make release_gate the single canonical github_issues resource (open when written, closed when gated). Use passed on the Production gate GetStep to pin it to the version created by the QA PutStep.
Verdict: Architecturally correct and idiomatic Concourse. Requires restructuring the resource names so QA writes and Production reads the same versioned resource.
Option D — Accept and document (current state)
The risk only materialises when a stale release issue is closed out-of-band. The skip_if_labeled: [abandoned] label discipline partially covers this. Documenting the operational rule ("always label un-merged release issues as 'abandoned' before closing") reduces real-world risk to near-zero without code changes.
Verdict: Acceptable for an initial rollout; Option B (version: every) is a cheap complement.
My recommendation: land Option B as a quick hardening measure in this PR, and track Option C as a follow-up once the pipeline is running in production and we can verify the passed-constraint approach works with the github_issues resource.
There was a problem hiding this comment.
Update: Option C is not viable with this resource.
After verifying with the resource implementation: the github_issues resource version is purely ID-based (derived from the issue number/ID), and is not affected by state transitions. Opening and closing the same issue produce the same version fingerprint.
This breaks Option C completely. Here's the failure mode:
- QA job:
PutStepcreates issue Bump boto3 from 1.17.69 to 1.17.70 #142, implicit get registers version{id: 142}as consumed by the QA job. - Production job:
GetStep(passed=[qa_job])asks Concourse: "has version{id: 142}been seen by the QA job?" — YES, immediately, because QA just consumed it. - Production triggers right after QA finishes. The gate is completely bypassed.
The passed mechanism works when a version needs to flow through jobs sequentially (e.g., an image built in job A used in job B). It cannot model "wait for this specific thing to change state" because state changes don't produce new versions here.
The two-resource architecture is therefore correct by necessity. release_gate must be a separate resource watching for newly-closed issues because that state transition is the only event that produces a new version for Concourse to trigger on. There is no single-resource equivalent.
What would be needed for Option C to work: the github_issues resource would need to version on state transitions — e.g., version {id: 142, state: open} and {id: 142, state: closed} as distinct versions. That would require a change to the resource itself, which is out of scope here.
Current best posture: Option B (version: every, already implemented) + operational discipline with the abandoned label. This is the ceiling of what's achievable without modifying the resource type.
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the Doof-era release-candidate/release branch model with a
Concourse-native release workflow using the ol-concourse release resource,
GitHub Deployments, and GitHub Issues as the production gate.
Changes:
- Remove repo_rc_branch / repo_release_branch from AppPipelineParams; add
github_repo (defaults to mitodl/{repo_name})
- Simplify _define_git_resources: drop RC and release branch git resources;
only main and ol-infra repos remain
- Add _define_release_resources: creates release resource (trigger),
release-gate (closed-issue prod trigger), release-issue (open-issue post
to QA), and GitHub Deployment resources for RC and Production environments
- Replace _build_image_job(branch_type=release_candidate) with
_build_release_image_job which:
- Gets the release resource (trigger) and main repo source
- Runs bump_version_task to propagate the version into app source
- Puts the release resource (action=create) to commit the bump and
create the release branch + tag in GitHub
- Builds and pushes a versioned Docker image to DockerHub and ECR
- Wire custom_dependencies into pulumi_jobs_chain:
- QA (index 0): get release resource for checklist.md, put
deployment-rc action=start
- Production (index 1): get release-gate (trigger), put
deployment-prod action=start
- Wire additional_post_steps:
- QA: put release-issue (body from checklist.md), put
deployment-rc action=finish
- Production: put deployment-prod action=finish
- Import release_resource_type and github_deployments_resource from
ol-concourse 0.7.0 (already specified as dependency)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
for more information, see https://pre-commit.ci
- Fix stale git_ref: load from release resource AFTER put(action=create) so the image is stamped with the actual release commit SHA, not the pre-release commit. - Fix build context: switch container_build_task inputs/CONTEXT/DOCKERFILE from main_repo to release_res so the image is built from the release commit tree (which includes the version bump). - Fix additional_tags: source short_ref from release resource instead of main_repo for consistency with the above. - Fix hardcoded branch: pass repo_main_branch into _define_release_resources so apps using 'master' (micromasters, xpro, ocw-studio, odl-video-service) get a correctly configured release resource. - Fix job name: derive _build_image_job name from the actual branch (git_repo_resource.source['branch']) instead of hardcoding 'main'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pipeline_parameters.github_repo is typed str | None (set via model_validator) so use an explicit fallback to satisfy mypy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensures each closed release issue is processed as a distinct trigger version rather than Concourse jumping to the latest closed issue. Prevents a stale issue closure from silently batching with (and stealing the slot of) the intended release gate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
48e5205 to
f9bab79
Compare
Set VERSION in Django settings to the new YYYY.MM.DD.N format and add [tool.bumpver] config pointing at the settings file. The Concourse release pipeline (mitodl/ol-infrastructure#4506) runs `bumpver update --set-version $VERSION --no-commit --no-fetch` on each release, so bumpver must be configured before the pipeline can manage versions automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What are the relevant tickets?
N/A — linked to discussion https://github.com/mitodl/hq/discussions/10465
Description (What does it do?)
Replaces the Doof-era
release-candidate/releasegit branch model for app deployments with a Concourse-native release workflow.Before: Doof pushed commits to
release-candidateto trigger RC builds, and pushed toreleaseto trigger production deploys.After: A dedicated
releaseConcourse resource (from the ol-concourse release resource) drives the full lifecycle:build-{app}-release-imagebump_version_taskto update version strings in app sourceaction: create) to commit the version bump and create a release branch + tag in GitHubchecklist.md) and marks the RC GitHub Deployment as successfulrelease-gateresource (github-issues, closed state) triggers production when the issue is closedKey changes to
docker_pulumi.py:AppPipelineParams: removedrepo_rc_branch/repo_release_branch; addedgithub_repo(defaults tomitodl/{repo_name})_define_git_resources: simplified to main + ol-infra repos only_define_release_resources: creates release, release-gate, release-issue, and GitHub Deployment resources_build_image_job(branch_type=release_candidate)with_build_release_image_jobpulumi_jobs_chainwired withcustom_dependencies(per-stack) andadditional_post_stepsfor GitHub Deployments and Release Issuesrelease_resource_typeandgithub_deployments_resourcefrom ol-concourse 0.7.0 (already pinned as dependency)How can this be tested?
Generated pipeline JSON was validated by running the script for mitxonline and verifying the jobs/resources/resource_types structure is correct. A test pipeline was created at https://cicd.odl.mit.edu/teams/infrastructure/pipelines/docker-pulumi-micromasters-relesae-test
To validate locally:
uv run python src/ol_concourse/pipelines/infrastructure/k8s_apps/docker_pulumi.py mitxonline # Inspect definition.json for expected jobs, resources, and resource_typesAdditional Context
This PR requires ol-concourse 0.7.0 (which adds
release_resource_type()), already reflected inpyproject.toml. App repos will needbumpverconfigured forbump_version_taskto succeed — teams will be notified separately.