Skip to content

feat(ci): release process fixes from testing#4195

Open
sveitser wants to merge 14 commits into
mainfrom
ma/version-scheme-2
Open

feat(ci): release process fixes from testing#4195
sveitser wants to merge 14 commits into
mainfrom
ma/version-scheme-2

Conversation

@sveitser
Copy link
Copy Markdown
Collaborator

Tag push events triggered by GITHUB_TOKEN don't trigger other workflows, but workflow_dispatch does. Dispatch build.yml explicitly after creating the release so Docker images get built without needing a PAT or GitHub App.

Tag push events triggered by GITHUB_TOKEN don't trigger other workflows,
but workflow_dispatch does. Dispatch build.yml explicitly after creating
the release so Docker images get built without needing a PAT or GitHub App.
Workflow dispatch requires actions:write permission; contents:write
alone is not enough.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation in doc/software-releases.md to clarify the release workflow. It adds a description explaining that create-release.yml dispatches build.yml to ensure Docker images are built, as tag pushes via GITHUB_TOKEN do not automatically trigger other workflows. I have no feedback to provide as there were no review comments.

- Drop YYYYMMDD.rcN git tag concept. Versions are defined once
  (YYYYMMDD), and if broken a new version is cut (YYYYMMDD.1 or
  next day).
- Rename decaf.rc/mainnet.rc floating tags to decaf.canary/mainnet.canary
  since they no longer represent RC versions, just early-adopter tiers.
- Classify YYYYMMDD.N same-day releases as Release (not Pre-release).
- Reject YYYYMMDD-* internal tags from promotion.
Adds a step-by-step walkthrough section referencing screenshots in
doc/assets/, including a note to always select main in the
"Use workflow from" dropdown.
@sveitser sveitser changed the title feat(ci): dispatch build.yml from create-release feat(ci): release process fixes from testing Apr 17, 2026
Shows current floating Docker tag targets, recent YYYYMMDD git tags
(with Release/Pre-release classification and originating branch), and
active release-* branches with tags not on main. Python stdlib only;
requires gh with read:packages scope.

Adds a `just release-status` recipe and a short section in the release
versioning doc.
Switch from gh api (which requires read:packages scope) to docker
buildx imagetools inspect for floating-tag-to-version mapping -- works
anonymously for public packages. Parallelize docker inspects and
per-tag git branch --contains calls via ThreadPoolExecutor. Full run
drops from ~30s to ~5s.
Compute git tag --merged for main and each release branch once at
startup (parallelized), then look up tag->branch via dict instead of
per-tag git branch --contains calls. Also fetch gh API calls in
parallel. Runs drop from ~5s to ~1.3s.
- Replace `docker buildx imagetools inspect` with direct HTTP to the
  GHCR OCI registry. Anonymous token is fetched once. Removes ~400ms
  per call; full run drops from ~1.3s to ~0.7s, --floating to ~0.4s.
- Add --loglevel debug which prints per-command and per-stage timings
  to stderr.
- Skip launching unnecessary work when --floating is passed.
- Hide release branches with no release tags by default. Add
  --all-branches to include them.
…nches

- Default window reduced from 180 to 60 days.
- git fetch --tags is off by default (use --fetch to enable). Network
  latency makes it the slowest step and many runs don't need fresh
  data.
- Hide release branches with no release tags (use --all-branches to
  include them).
- Doc updated with new flags.
The repo is public so SSH auth is unnecessary. Fetch over HTTPS
(0.3-0.5s) instead of SSH (1.6s) and only fetch YYYYMMDD tags plus
release/main branches. Full run with fetch drops from ~2.3s to
~1.3s. Restore fetching as the default.
- Drop the --paginate flag and the "][" JSON concatenation hack; for
  the endpoints we call, per_page=100 is enough.
- Collapse fetch_promote_runs and fetch_release_classification's
  list/dict fallbacks.
- section_floating takes one digests dict instead of two.
Type annotations throughout (PEP 585/604, Python 3.9+). Introduce
ReleaseBranch NamedTuple instead of positional tuples. Start git fetch
concurrently with floating tag digest lookups and gh workflow runs;
candidate digest lookups start after fetch completes. Full run with
fetch drops from ~1.3s to ~1.0s.
@sveitser sveitser marked this pull request as ready for review April 27, 2026 14:57
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 27, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@github-actions
Copy link
Copy Markdown
Contributor

Nextest failures (1) in this run

Test Attempts Time (s) Main history
hotshot-testing::tests_5::tests_5::combined_network::test_combined_network_cdn_crash 1 76.30 passing

See the step summary for flaky tests and slowest tests.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 29, 2026

Claude finished @sveitser's task in 2m 18s —— View job


Code Review: PR #4195 — Release process fixes from testing

  • Read existing review comments
  • Review CI workflow changes
  • Review documentation changes
  • Review new release-status script
  • Review justfile changes
  • Post review feedback

Summary

Clean PR that solves a real GitHub limitation (tag pushes from GITHUB_TOKEN don't trigger other workflows) and renames the rc concept to canary for clarity. The documentation rewrite is a significant improvement — the step-by-step walkthrough with screenshots will save people a lot of trial and error. The release-status script is a thoughtful operational tool with nice performance work (parallelized fetches, GHCR HTTP API, etc.).

CI Workflows

create-release.yml — The core fix is well-designed: after gh release create, dispatch build.yml via workflow_dispatch against the new tag. Adding actions: write permission is necessary and correctly scoped. The classify regex update ((\..+)?$ instead of just $) correctly makes YYYYMMDD.1 a Release rather than Pre-release.

promote-docker-tag.yml — The rccanary rename is consistent throughout. The validation regex tightening to reject YYYYMMDD-* tags from promotion is a good safety measure, since those are internal-only. Progression chain is correctly updated.

Documentation

The walkthrough is clear and well-structured. Good call adding the "If Something Goes Wrong" section and the note about floating tags vs. manual operator steps. The "always select main in the workflow dropdown" instruction will prevent a common source of confusion.

release-status script

Well-written overall — good use of ThreadPoolExecutor for parallelizing network calls, type hints, and structured output. Two issues noted via inline comments:

  1. promote_run_for substring matching (moderate) — in-based matching on display titles causes false positives: "decaf" matches "decaf.canary", and "20260408" matches "20260408.1". Using the workflow run's inputs field would give exact matching.

  2. ghcr_token() thread-safety (minor) — the global token is read/written from multiple threads without synchronization. Harmless in practice (worst case: duplicate fetches), but easy to fix.

Minor note: fetch_promote_runs and fetch_release_classification both cap at per_page=100. This is fine for now, but could silently truncate results as the project accumulates workflow runs over time.

Justfile

Simple passthrough — looks good.

Overall

The PR is in good shape. The only item I'd suggest addressing before merge is the promote_run_for substring matching, since it can produce visibly wrong output (showing the wrong promote date or source tag).

Comment thread scripts/release-status
Comment on lines +183 to +189
matches = [
r for r in runs
if r.get("conclusion") == "success"
and floating_tag in (r.get("display_title") or r.get("name") or "")
and release_tag in (r.get("display_title") or r.get("name") or "")
]
return max(matches, key=lambda r: r["created_at"]) if matches else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: substring matching causes false positives

Both checks use in (substring containment), which leads to incorrect matches in at least two ways:

  1. Floating tag collision — searching for decaf matches a run titled "… decaf.canary …", and mainnet matches mainnet.canary.
  2. Release tag collision — searching for 20260408 matches "… 20260408.1 …".

So a promote of 20260408.1 to decaf.canary would incorrectly appear as a match when looking for 20260408 promoted to decaf.

Consider using word-boundary matching or parsing the workflow run inputs instead. The inputs are available on workflow_dispatch runs:

# e.g. r.get("inputs", {}).get("floating-tag") == floating_tag

Comment thread scripts/release-status
Comment on lines +100 to +109
_ghcr_token: str | None = None


def ghcr_token() -> str:
global _ghcr_token
if _ghcr_token is None:
url = f"https://ghcr.io/token?scope=repository:{GHCR_REPO}:pull"
with urllib.request.urlopen(url, timeout=10) as resp:
_ghcr_token = json.loads(resp.read())["token"]
return _ghcr_token
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ghcr_token() is called from multiple threads (via tag_digest submitted to the ThreadPoolExecutor), but the global _ghcr_token check-then-set isn't synchronized. In practice this just means a few redundant token fetches on the first batch, not a correctness bug. A threading.Lock would make it explicit, or you could fetch the token once before submitting the digest futures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant