feat(imagefamily): implement GKE release channel awareness#342
Merged
Conversation
Contributor
|
Thanks to your contribution, the maintainers will review it as soon as they can! |
Contributor
|
The release note is either empty or incomplete, please consider: |
dm3ch
commented
May 19, 2026
bcc0c15 to
f84badd
Compare
2 tasks
dm3ch
added a commit
to dm3ch/karpenter-provider-gcp
that referenced
this pull request
May 19, 2026
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach. PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it. Rewrite the doc to lead with the new structured selection modes (channel tracking, version pin), keep the unique operational content (gcloud version-discovery commands, disruption budget patterns, cluster upgrade notes), and demote the alias format to a Legacy section. Also add a forward pointer in MIGRATION.md's alias section pointing users to the new structured fields. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
f84badd to
1fb816d
Compare
dm3ch
added a commit
to dm3ch/karpenter-provider-gcp
that referenced
this pull request
May 19, 2026
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach. PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it. Rewrite the doc to lead with the new structured selection modes (channel tracking, version pin), keep the unique operational content (gcloud version-discovery commands, disruption budget patterns, cluster upgrade notes), and demote the alias format to a Legacy section. Also add a forward pointer in MIGRATION.md's alias section pointing users to the new structured fields. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
02ab052 to
3e599aa
Compare
3 tasks
Collaborator
Author
E2E Test ResultsCluster: `asia-southeast1-b` | Commit: `23e1edda` | Total duration: ~12m (provisioning), ~5m (drift/gpu/channel)
Overall: ✅ all pass |
thameezb
pushed a commit
to dm3ch/karpenter-provider-gcp
that referenced
this pull request
May 27, 2026
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach. PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it. Rewrite the doc to lead with the new structured selection modes (channel tracking, version pin), keep the unique operational content (gcloud version-discovery commands, disruption budget patterns, cluster upgrade notes), and demote the alias format to a Legacy section. Also add a forward pointer in MIGRATION.md's alias section pointing users to the new structured fields. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
23e1edd to
becd88f
Compare
…ource Instead of requiring pre-created karpenter-specific pools by name, discover the first RUNNING or RUNNING_WITH_ERROR pool alphabetically (preferring default-pool). Create karpenter-fallback immediately on Sync failure rather than waiting N retry cycles — the pool is zero-node, so no compute cost. Add DefaultNodePoolTemplateName option to pin a specific pool when the cluster has multiple candidates. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…rovisioning Add PatchKubeEnvForOSType to adjust gke-os-distribution and BFQ scheduler settings when the target image family differs from the source pool's OS. Add PatchKubeEnvForArch (kubeenv_arch.go) to update SERVER_BINARY_TAR_URL and SERVER_BINARY_TAR_HASH when the target arch differs from the source pool's arch. The SHA-512 hash is fetched from the public GCS sidecar file and cached. Wire both patches in instance.go via a patchKubeEnv helper (reduces cyclomatic complexity of setupInstanceMetadata). Add unit tests for all cross-arch/OS paths. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Allow cgpv1 images so provisioned nodes can match the cgroup version the cluster uses. Improve Ubuntu image pagination to avoid accumulating thousands of deprecated images in memory on large clusters. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…Helm value Adds DEFAULT_NODEPOOL_TEMPLATE_NAME operator flag and matching Helm value so operators can pin Karpenter to a specific existing node pool for bootstrap metadata rather than relying on auto-discovery. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…rap pools Design record for PR cloudpilot-ai#263. Status: Implemented. - proposals/ added to docs-lint CI path trigger and Makefile DOCS_FILES - lefthook.yaml added to .gitignore (local-only file) Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…tions - ResolveVersionForChannel: extract findChannelConfig and highestVersionForMinor helpers to reduce cyclomatic complexity below threshold - image.go: add //nolint:staticcheck on term.Alias accesses (deprecated field kept for backward compat); add ctx to resolveImage for cancellation support - CRDs: bump controller-gen annotation to v0.21.0 (local toolchain) Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
New test suite covers: - COS family with channel: stable (resolves via GKE server config) - COS family with version: latest (equivalent to existing alias path) - Ubuntu2404 family with version: latest New helpers in test/pkg/environment/helpers.go: - CreateNodeClassWithFamilyChannel: imageSelectorTerm with family+channel - CreateNodeClassWithFamilyVersion: imageSelectorTerm with family+version Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Document two divergences between proposal and initial implementation: 1. Success-path Status.Conditions messages are not yet emitted (only error paths produce condition messages); normal resolution path logging is deferred to a follow-up. 2. Ubuntu version-pin catalog miss currently surfaces as empty Status.Images (ImagesReady=False) rather than a specific error message; explicit catalog lookup for pinned dates is deferred to a follow-up. Both cases set ImagesReady=False correctly; only the message granularity differs from the proposal spec. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Missed during the rebase of bcc0c15 onto origin/main. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Remove unused cosVersionRe, extra blank line in instance.go, and realign variable block in utils.go per gofmt. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Picks up gpuDriverVersion field and alias CEL validation rules that were missing from the feature branch CRDs. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…Images After rebase, the cosVersionRe guard was removed when resolving the conflict for commit a69458d. The test TestResolveImages_COS_PinnedVersion_InvalidFormat (from main via PR cloudpilot-ai#327) expects invalid milestone formats to return imageResolutionError before hitting the compute API. Restore the check in the milestone pin path, after parseGKEVersion() returns false. GKE version strings (e.g. 1.34.6-gke.1068000) still take the exact-build path and are unaffected. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…n_Valid The test was written before the Ubuntu struct gained a release field. With release="", deriveArm64Image falls through to the 2204 path which inserts -arm64- before the version suffix instead of replacing -amd64-, producing -amd64-arm64-v... instead of -arm64-v.... Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach. PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it. Rewrite the doc to lead with the new structured selection modes (channel tracking, version pin), keep the unique operational content (gcloud version-discovery commands, disruption budget patterns, cluster upgrade notes), and demote the alias format to a Legacy section. Also add a forward pointer in MIGRATION.md's alias section pointing users to the new structured fields. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
PULL_REQUEST_TEMPLATE.md and .github/workflows/docs.yaml were inadvertently modified during branch development. Restore to origin/main state. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
CEL string literals don't accept \. as an escape sequence; use [.] for literal dot matches. Also split the single Ubuntu family-in-list rule (which exceeded cost budget by 2.8x) into two per-family rules. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Required for release-channel image resolution (GetServerConfig API call). Also adds a dedicated MIGRATION.md section for the release channel feature with the IAM note for manual-role users. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
CEL regex cost is proportional to the maximum string length. Without a MaxLength constraint the Kubernetes cost estimator uses a very large default, causing the .matches() rules to exceed the per-rule and schema-wide cost budgets. 32 chars covers all valid version strings (milestone: 19 chars, Ubuntu date: 9 chars, 'latest': 6 chars). Also update MIGRATION.md: IAM note is informational not "action required" since this release ships the permission in the standard role; alias deprecation note explicitly says no forced removal in this release. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Pure computation (ResolveVersionForChannel and helpers) separated from the API client in gke.go. Also demote Option 2 to non-recommended in image-management.md — channel tracking is the single recommended path. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Mirrors the gke.go → version.go production split. Pure logic tests (ResolveVersionForChannel) are now separate from the HTTP/cache tests. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Consolidates the family→ImageFamily mapping into a single map in the constructor, removing the duplicate switch in getImageFamilyProvider. Adding a new family is now a one-liner in NewDefaultProvider. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
… notice Lead with the deprecation and migration path rather than describing the new feature. Show before/after alias → structured field examples. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Accidentally dropped in an earlier commit. No functional change. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Mark alias as deprecated in troubleshooting; update all example NodeClass and doc YAML snippets to use the structured fields. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…ions; remove no-op TrimPrefix filterExistingImages now returns imageResolutionError when candidates were resolved but none exist in GCP, restoring the ImagesReady=False diagnostic for pinned-version misses that the old alias path provided. Also remove the redundant TrimPrefix(m, "") in deriveArm64Image (2204 path) — TrimPrefix with an empty prefix is a no-op. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Add post-Ready image version assertions: - COS/version:latest: resolved image must equal independently-queried latest - Ubuntu2404/version:latest: resolved image must contain current version token - channel tests: resolved image must be from the expected GCP project Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
resolveExactBuildCOSImage already documented "a miss should surface as ImagesReady=False" but returned a plain fmt.Errorf, bypassing the IsImageResolutionError check in the status controller and causing silent exponential backoff instead of an ImagesReady=False condition. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…st, verify channel build
- Drop COS/version:latest spec: covered by provisioning suite's basic COS test
- Export ParseGKEVersion so tests can parse GKE version strings
- Add GetChannelVersion helper to Environment (fetches server config, calls
ResolveVersionForChannel) so tests can determine the expected GKE build
- Replace weak project-prefix assertion for channel tests with exact build
verification: resolved image must embed gke-{k8sKey}-gke{build}- from the
version the server config returns for the requested channel
- Keep Ubuntu2404/version:latest with version-token assertion (different naming
convention and project from COS; no equivalent provisioning coverage)
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
ResolveVersionForChannel errors (unknown channel, no version for cluster minor) and the UNSPECIFIED cluster channel case are all user-config failures. Wrapping them as imageResolutionError surfaces them as ImagesReady=False instead of triggering controller-runtime exponential backoff. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
GCE VMs appear in instances.list within seconds of the create call, but the lifecycle controller only writes providerID back to the NodeClaim after the zone operation completes (30–90s). The previous 30s grace was too short: a VM taking ~35s to provision would pass the grace window before its NodeClaim was updated, causing the GC to delete it as "orphaned" and leaving the NodeClaim permanently stuck with Launched=True but no backing instance. 3 minutes covers the full provisioning cycle in all cases and costs at most one extra GC cycle before a genuine orphan is collected. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…balanced c4a machines require hyperdisk-balanced, not pd-balanced. All e2e NodeClasses use pd-balanced by default, so every c4a attempt fails with a GCP 400 error before falling through to t2a. Remove c4a from all four arm64 specs (COS and Ubuntu, on-demand and spot) and add t2a-standard-4 as a second option. Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
becd88f to
02a2261
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Implements proposal/0002-release-channel-awareness: adds structured
family/channel/versionfields toImageSelectorTerm, allowing image selection by GKE release channel or explicit version rather than only by the (now-deprecated)aliasstring.New
ImageSelectorTermfields:family— image OS family (ContainerOptimizedOS,Ubuntu2404,Ubuntu2204)channel— GKE release channel (rapid,regular,stable,extended, orclusterto follow the cluster's enrolled channel)version— explicit version string orlatestHow it works:
channel:is set, the provider callsGetServerConfig(Container API, 30-min cache) to resolve the highest K8s minor version shipped for that channel, then derives the corresponding GCE image via an exact GKE build filter (gke-{k8sKey}-gke{build}-*).channel: clusterreads the cluster's own enrolled channel fromGetCluster.version:is set, the milestone (125.19216.104.126) or date (v20260416) is used directly; invalid formats are rejected at admission.ImagesReady=Falsewith reasonImageResolutionFailedrather than silently falling back.Ubuntu 2204 support:
isUsableUbuntu2204Imagehandles the 2204 naming convention (no-amd64-suffix on amd64 images); arm64 derivation inserts-arm64-before the-v{date}version suffix.aliasbackward compatibility: The existingaliasfield continues to work unchanged. It is now soft-deprecated — new configurations should usefamily/channel/version.Docs: New
docs/image-selection.md(selection mode reference);docs/image-management.mdrewritten to lead with the new paradigm (alias demoted to a Legacy section).GC race fix (included in this PR): Increased the orphan garbage-collection grace period from 30 s to 3 min. GCE VMs appear in
instances.listwithin seconds of the create call, but the lifecycle controller only writesproviderIDback to the NodeClaim after the zone operation completes (30–90 s). The previous 30 s grace was narrow enough to delete a legitimately-launching VM before its NodeClaim was updated, leaving the NodeClaim permanently stuck withLaunched=Truebut no backing instance.E2E fix (included in this PR): Dropped
c4afrom all four arm64 provisioning specs.c4amachines requirehyperdisk-balancedbut e2e NodeClasses usepd-balancedby default, so everyc4aattempt fails with a GCP 400 before falling through tot2a. Removingc4amakes the arm64 specs reliable without narrowing what Karpenter itself supports.Related proposal (if applicable):
proposals/0002-release-channel-awareness.mdWhich issue(s) this PR fixes:
Fixes #330
Docs and examples
docs/andexamples/updated (or this PR does not affect user-facing behaviour, configuration, or APIs)MIGRATION.mdupdated (or this PR does not require any migration steps)Special notes for your reviewer:
aliasCEL validation rules from the previous release are preserved unchanged for backward compatibility.GetServerConfigresults are cached for 30 minutes to avoid hammering the GKE Container API on every reconcile.test/suites/channel-image-selection/validates channel-based, version-pinned, andversion: latestselection end-to-end against a live GKE cluster.//nolint:staticcheckonterm.Aliasaccesses inimage.go: theAliasfield carries a// Deprecated:doc comment so that IDEs andgo vetsurface the deprecation to users. That same annotation causesstaticcheck(SA1019) to flag our own backward-compat handler; the suppression is intentional — we own both the deprecated field and the code that must keep reading it.providerIDback to the NodeClaim, causing the GC to delete it as an orphan. This was observed in e2e: theCOS / amd64 / spotprovisioning spec failed because the GC deleted the VM at16:44:15(35 s after creation) while the lifecycle controller completed at16:44:15.842(117 ms later). 3 min covers the full provisioning cycle at the cost of one extra GC cycle before a genuine orphan is collected.c4adropped from arm64 e2e specs:c4amachines are incompatible withpd-balanceddisk (they requirehyperdisk-balanced). The provisioning tests now use onlyt2a-standard-{2,4}for arm64 — which is what actually works inasia-southeast1with standard NodeClass config. This is a test-only change; Karpenter itself continues to surfacec4aofferings.Greptile Summary
This PR implements GKE release channel awareness for image selection, adding structured
family/channel/versionfields toImageSelectorTermas an alternative to the now-deprecatedaliasstring. It also includes a GC grace-period fix (30 s → 3 min) to prevent premature orphan deletion of legitimately-launching VMs, and dropsc4afrom arm64 e2e specs to fix flaky test failures.resolveFamilyTermroutes through a newresolveVersionhelper that callsGetServerConfig(30-min cache) whenchannel:is set andGetClusterforchannel: cluster, then delegates to per-familyResolveImagesimplementations. A newresolveExactBuildCOSImagepath uses an exact GKE-build filter and surfaces misses asimageResolutionError.isUsableUbuntu2204Imageidentifies amd64 images by excluding-arm64-, andderiveArm64Imageinserts-arm64-before the-v{date}suffix for that series.gcGracePeriodraised to 3 minutes to cover the full VM provisioning cycle before the lifecycle controller writesproviderIDback to the NodeClaim.Confidence Score: 5/5
Safe to merge; the new channel-based image resolution path is well-structured, imageResolutionError is correctly propagated in all new code paths, and the GC grace-period change is conservative and well-justified.
The core logic — channel resolution, version lookup, image filtering, and error classification — all look correct. Unit tests cover the key resolution scenarios, edge cases (UNSPECIFIED cluster channel, missing builds, semver ordering), and the Ubuntu 2204 naming conventions. The two findings are design-level observations rather than present defects.
pkg/providers/imagefamily/ubuntu.go — the ubuntuVersionRe replacement scope; pkg/apis/v1alpha1/gcenodeclass.go — mixed-family ImageSelectorTerms behaviour in ImageFamily().
Important Files Changed
Sequence Diagram
sequenceDiagram participant NC as NodeClass Reconciler participant IP as ImageProvider participant GKE as GKE Provider participant GCP as GCP Compute API NC->>IP: List(nodeClass) IP->>IP: hash(ImageSelectorTerms) → cache key alt cache hit IP-->>NC: cached Images else cache miss loop each ImageSelectorTerm alt term.Alias set IP->>IP: resolveAliasTerm(term) else term.ID set IP->>IP: resolveIDTerm(term) else term.Family set IP->>IP: resolveFamilyTerm(term) alt term.Version set IP->>IP: "version = term.Version" else "term.Channel == cluster" IP->>GKE: GetClusterConfig() [30m cache] GKE-->>IP: cluster.ReleaseChannel.Channel IP->>GKE: GetServerConfig() [30m cache] GKE-->>IP: ServerConfig.Channels IP->>IP: ResolveVersionForChannel IP-->>IP: gkeVersion else term.Channel set IP->>GKE: GetServerConfig() [30m cache] GKE-->>IP: ServerConfig.Channels IP->>IP: ResolveVersionForChannel IP-->>IP: gkeVersion end IP->>IP: provider.ResolveImages(ctx, version) alt GKE version format IP->>GCP: "Images.List(gke-k8sKey-gkebuild-*)" GCP-->>IP: matching COS images IP->>IP: pick best, build amd64/arm64/gpu URLs else latest or milestone pin IP->>GCP: "Images.List(gke-k8sKey-*)" GCP-->>IP: candidates IP->>IP: pick latest, derive variants end IP->>GCP: Images.Get(each candidate) GCP-->>IP: exists or 404 IP->>IP: imageResolutionError if all absent end end IP->>IP: cache.Set(key, result, 5m) IP-->>NC: Images endReviews (7): Last reviewed commit: "test(provisioning): drop c4a from arm64 ..." | Re-trigger Greptile