Skip to content

feat(imagefamily): implement GKE release channel awareness#342

Merged
thameezb merged 52 commits into
cloudpilot-ai:mainfrom
dm3ch:feat/release-channel-impl
May 27, 2026
Merged

feat(imagefamily): implement GKE release channel awareness#342
thameezb merged 52 commits into
cloudpilot-ai:mainfrom
dm3ch:feat/release-channel-impl

Conversation

@dm3ch
Copy link
Copy Markdown
Collaborator

@dm3ch dm3ch commented May 14, 2026

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Implements proposal/0002-release-channel-awareness: adds structured family/channel/version fields to ImageSelectorTerm, allowing image selection by GKE release channel or explicit version rather than only by the (now-deprecated) alias string.

New ImageSelectorTerm fields:

  • family — image OS family (ContainerOptimizedOS, Ubuntu2404, Ubuntu2204)
  • channel — GKE release channel (rapid, regular, stable, extended, or cluster to follow the cluster's enrolled channel)
  • version — explicit version string or latest

How it works:

  • When channel: is set, the provider calls GetServerConfig (Container API, 30-min cache) to resolve the highest K8s minor version shipped for that channel, then derives the corresponding GCE image via an exact GKE build filter (gke-{k8sKey}-gke{build}-*).
  • channel: cluster reads the cluster's own enrolled channel from GetCluster.
  • When version: is set, the milestone (125.19216.104.126) or date (v20260416) is used directly; invalid formats are rejected at admission.
  • A build miss (no image for the channel's GKE version) surfaces as ImagesReady=False with reason ImageResolutionFailed rather than silently falling back.

Ubuntu 2204 support: isUsableUbuntu2204Image handles the 2204 naming convention (no -amd64- suffix on amd64 images); arm64 derivation inserts -arm64- before the -v{date} version suffix.

alias backward compatibility: The existing alias field continues to work unchanged. It is now soft-deprecated — new configurations should use family/channel/version.

Docs: New docs/image-selection.md (selection mode reference); docs/image-management.md rewritten to lead with the new paradigm (alias demoted to a Legacy section).

GC race fix (included in this PR): Increased the orphan garbage-collection grace period from 30 s to 3 min. GCE VMs appear in instances.list within seconds of the create call, but the lifecycle controller only writes providerID back to the NodeClaim after the zone operation completes (30–90 s). The previous 30 s grace was narrow enough to delete a legitimately-launching VM before its NodeClaim was updated, leaving the NodeClaim permanently stuck with Launched=True but no backing instance.

E2E fix (included in this PR): Dropped c4a from all four arm64 provisioning specs. c4a machines require hyperdisk-balanced but e2e NodeClasses use pd-balanced by default, so every c4a attempt fails with a GCP 400 before falling through to t2a. Removing c4a makes the arm64 specs reliable without narrowing what Karpenter itself supports.

Related proposal (if applicable):

proposals/0002-release-channel-awareness.md

Which issue(s) this PR fixes:

Fixes #330

Docs and examples

  • docs/ and examples/ updated (or this PR does not affect user-facing behaviour, configuration, or APIs)
  • MIGRATION.md updated (or this PR does not require any migration steps)

Special notes for your reviewer:

  • alias CEL validation rules from the previous release are preserved unchanged for backward compatibility.
  • GetServerConfig results are cached for 30 minutes to avoid hammering the GKE Container API on every reconcile.
  • E2E suite test/suites/channel-image-selection/ validates channel-based, version-pinned, and version: latest selection end-to-end against a live GKE cluster.
  • //nolint:staticcheck on term.Alias accesses in image.go: the Alias field carries a // Deprecated: doc comment so that IDEs and go vet surface the deprecation to users. That same annotation causes staticcheck (SA1019) to flag our own backward-compat handler; the suppression is intentional — we own both the deprecated field and the code that must keep reading it.
  • GC grace period (30 s → 3 min): The previous 30 s grace was too short — a VM taking ~35 s to provision would pass the grace window before the lifecycle controller wrote the providerID back to the NodeClaim, causing the GC to delete it as an orphan. This was observed in e2e: the COS / amd64 / spot provisioning spec failed because the GC deleted the VM at 16:44:15 (35 s after creation) while the lifecycle controller completed at 16:44:15.842 (117 ms later). 3 min covers the full provisioning cycle at the cost of one extra GC cycle before a genuine orphan is collected.
  • c4a dropped from arm64 e2e specs: c4a machines are incompatible with pd-balanced disk (they require hyperdisk-balanced). The provisioning tests now use only t2a-standard-{2,4} for arm64 — which is what actually works in asia-southeast1 with standard NodeClass config. This is a test-only change; Karpenter itself continues to surface c4a offerings.
  • This PR was prepared with AI assistance (Claude Code). All changes have been reviewed and verified by the author.

Greptile Summary

This PR implements GKE release channel awareness for image selection, adding structured family/channel/version fields to ImageSelectorTerm as an alternative to the now-deprecated alias string. It also includes a GC grace-period fix (30 s → 3 min) to prevent premature orphan deletion of legitimately-launching VMs, and drops c4a from arm64 e2e specs to fix flaky test failures.

  • Image selection redesign: resolveFamilyTerm routes through a new resolveVersion helper that calls GetServerConfig (30-min cache) when channel: is set and GetCluster for channel: cluster, then delegates to per-family ResolveImages implementations. A new resolveExactBuildCOSImage path uses an exact GKE-build filter and surfaces misses as imageResolutionError.
  • Ubuntu 2204 support: isUsableUbuntu2204Image identifies amd64 images by excluding -arm64-, and deriveArm64Image inserts -arm64- before the -v{date} suffix for that series.
  • GC grace period increase: gcGracePeriod raised to 3 minutes to cover the full VM provisioning cycle before the lifecycle controller writes providerID back to the NodeClaim.

Confidence Score: 5/5

Safe to merge; the new channel-based image resolution path is well-structured, imageResolutionError is correctly propagated in all new code paths, and the GC grace-period change is conservative and well-justified.

The core logic — channel resolution, version lookup, image filtering, and error classification — all look correct. Unit tests cover the key resolution scenarios, edge cases (UNSPECIFIED cluster channel, missing builds, semver ordering), and the Ubuntu 2204 naming conventions. The two findings are design-level observations rather than present defects.

pkg/providers/imagefamily/ubuntu.go — the ubuntuVersionRe replacement scope; pkg/apis/v1alpha1/gcenodeclass.go — mixed-family ImageSelectorTerms behaviour in ImageFamily().

Important Files Changed

Filename Overview
pkg/providers/imagefamily/image.go Core dispatch layer; adds resolveFamilyTerm and resolveVersion with correct imageResolutionError wrapping for channel/cluster lookup failures and empty-candidate GCP results.
pkg/providers/imagefamily/containeroptimizedos.go Adds ParseGKEVersion and resolveExactBuildCOSImage for channel-based resolution; build-miss is correctly classified as imageResolutionError; arm64 derivation regex requires exactly 4 COS version groups.
pkg/providers/imagefamily/ubuntu.go Adds Ubuntu 2204 support via release-parameterized struct; isUsableUbuntu2204Image correctly identifies amd64 images by excluding -arm64-; deriveArm64Image handles the 2204 naming convention correctly.
pkg/providers/gke/version.go New file implementing ResolveVersionForChannel; correctly uses semver comparison (not lexicographic) and handles defaultVersion/validVersions fallback; well-tested.
pkg/providers/gke/gke.go Adds GetClusterConfig and GetServerConfig with separate 30-min caches; both paths are straightforward GKE API wrappers.
pkg/apis/v1alpha1/gcenodeclass.go Adds Family/Channel/Version fields to ImageSelectorTerm with comprehensive CEL validation; ImageFamily() extended to handle new family terms mapping Ubuntu2404/2204 to ImageFamilyUbuntu for downstream compatibility.
pkg/controllers/nodeclaim/garbagecollection/controller.go GC grace period increased from 30 s to 3 min; well-documented rationale; isOrphaned logic is unchanged and correct.

Sequence Diagram

sequenceDiagram
    participant NC as NodeClass Reconciler
    participant IP as ImageProvider
    participant GKE as GKE Provider
    participant GCP as GCP Compute API

    NC->>IP: List(nodeClass)
    IP->>IP: hash(ImageSelectorTerms) → cache key
    alt cache hit
        IP-->>NC: cached Images
    else cache miss
        loop each ImageSelectorTerm
            alt term.Alias set
                IP->>IP: resolveAliasTerm(term)
            else term.ID set
                IP->>IP: resolveIDTerm(term)
            else term.Family set
                IP->>IP: resolveFamilyTerm(term)
                alt term.Version set
                    IP->>IP: "version = term.Version"
                else "term.Channel == cluster"
                    IP->>GKE: GetClusterConfig() [30m cache]
                    GKE-->>IP: cluster.ReleaseChannel.Channel
                    IP->>GKE: GetServerConfig() [30m cache]
                    GKE-->>IP: ServerConfig.Channels
                    IP->>IP: ResolveVersionForChannel
                    IP-->>IP: gkeVersion
                else term.Channel set
                    IP->>GKE: GetServerConfig() [30m cache]
                    GKE-->>IP: ServerConfig.Channels
                    IP->>IP: ResolveVersionForChannel
                    IP-->>IP: gkeVersion
                end
                IP->>IP: provider.ResolveImages(ctx, version)
                alt GKE version format
                    IP->>GCP: "Images.List(gke-k8sKey-gkebuild-*)"
                    GCP-->>IP: matching COS images
                    IP->>IP: pick best, build amd64/arm64/gpu URLs
                else latest or milestone pin
                    IP->>GCP: "Images.List(gke-k8sKey-*)"
                    GCP-->>IP: candidates
                    IP->>IP: pick latest, derive variants
                end
                IP->>GCP: Images.Get(each candidate)
                GCP-->>IP: exists or 404
                IP->>IP: imageResolutionError if all absent
            end
        end
        IP->>IP: cache.Set(key, result, 5m)
        IP-->>NC: Images
    end
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Reviews (7): Last reviewed commit: "test(provisioning): drop c4a from arm64 ..." | Re-trigger Greptile

@gitautomator
Copy link
Copy Markdown
Contributor

gitautomator Bot commented May 14, 2026

Thanks to your contribution, the maintainers will review it as soon as they can!

@dm3ch dm3ch added kind feature Something isn't working kind api-change Something isn't working labels May 14, 2026
@gitautomator
Copy link
Copy Markdown
Contributor

gitautomator Bot commented May 14, 2026

The release note is either empty or incomplete, please consider: Adds GKE release channel awareness to GCENodeClass image selection, allowing users to specify image family, GKE release channel (e.g., rapid, regular, stable), or Kubernetes version (including 'latest') for automatic resolution of appropriate COS or Ubuntu images.

@gitautomator gitautomator Bot requested review from jwcesign and thameezb May 14, 2026 02:41
@gitautomator gitautomator Bot added the enhancement New feature or request label May 14, 2026
Comment thread charts/karpenter/templates/deployment.yaml
Comment thread docs/reference/gcenodeclass.md
@dm3ch dm3ch force-pushed the feat/release-channel-impl branch from bcc0c15 to f84badd Compare May 19, 2026 14:17
dm3ch added a commit to dm3ch/karpenter-provider-gcp that referenced this pull request May 19, 2026
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach.
PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it.

Rewrite the doc to lead with the new structured selection modes (channel
tracking, version pin), keep the unique operational content (gcloud
version-discovery commands, disruption budget patterns, cluster upgrade
notes), and demote the alias format to a Legacy section.

Also add a forward pointer in MIGRATION.md's alias section pointing users
to the new structured fields.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
@dm3ch dm3ch force-pushed the feat/release-channel-impl branch from f84badd to 1fb816d Compare May 19, 2026 14:22
@dm3ch dm3ch changed the title feat(imagefamily): implement GKE release channel awareness (proposal/0002) feat(imagefamily): implement GKE release channel awareness May 19, 2026
@dm3ch dm3ch marked this pull request as ready for review May 19, 2026 14:44
Comment thread pkg/providers/imagefamily/image.go
Comment thread pkg/providers/imagefamily/ubuntu.go Outdated
dm3ch added a commit to dm3ch/karpenter-provider-gcp that referenced this pull request May 19, 2026
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach.
PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it.

Rewrite the doc to lead with the new structured selection modes (channel
tracking, version pin), keep the unique operational content (gcloud
version-discovery commands, disruption budget patterns, cluster upgrade
notes), and demote the alias format to a Legacy section.

Also add a forward pointer in MIGRATION.md's alias section pointing users
to the new structured fields.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
@dm3ch dm3ch force-pushed the feat/release-channel-impl branch from 02ab052 to 3e599aa Compare May 19, 2026 15:13
Comment thread pkg/providers/imagefamily/containeroptimizedos.go
Comment thread pkg/providers/imagefamily/image.go Outdated
@dm3ch
Copy link
Copy Markdown
Collaborator Author

dm3ch commented May 19, 2026

E2E Test Results

Cluster: `asia-southeast1-b` | Commit: `23e1edda` | Total duration: ~12m (provisioning), ~5m (drift/gpu/channel)

Suite Spec Result Duration Notes
Provisioning COS / amd64 / on-demand ✅ pass ~3m
Provisioning COS / amd64 / spot ✅ pass 2m 18s Previously failing due to GC race — fixed by grace period 30s→3m
Provisioning COS / arm64 / on-demand ✅ pass ~3m
Provisioning COS / arm64 / spot ✅ pass ~3m
Provisioning Ubuntu / amd64 / on-demand ✅ pass ~3m
Provisioning Ubuntu / amd64 / spot ✅ pass ~3m
Provisioning Ubuntu / arm64 / on-demand ✅ pass 3m 31s Previously failing due to c4a + pd-balanced incompatibility — fixed by dropping c4a from arm64 specs
Provisioning Ubuntu / arm64 / spot ✅ pass ~3m
Provisioning Pinned COS version ✅ pass ~3m
Provisioning Pinned Ubuntu version ✅ pass ~3m
Provisioning NodeClass drift ✅ pass ~3m
Drift should replace a node when GCENodeClass metadata changes ✅ pass 5m 17s
GPU should provision a GPU node with correct taint, labels, and allocatable GPU resource ✅ pass 3m 8s
ChannelImageSelection COS / channel: stable / amd64 / on-demand ✅ pass ~3m Channel version verified against server config
ChannelImageSelection Ubuntu2404 / version: latest / amd64 / on-demand ✅ pass ~3m Ubuntu version verified in resolved image

Overall: ✅ all pass

Copy link
Copy Markdown
Collaborator

@thameezb thameezb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work

@thameezb thameezb enabled auto-merge (squash) May 27, 2026 12:21
thameezb pushed a commit to dm3ch/karpenter-provider-gcp that referenced this pull request May 27, 2026
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach.
PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it.

Rewrite the doc to lead with the new structured selection modes (channel
tracking, version pin), keep the unique operational content (gcloud
version-discovery commands, disruption budget patterns, cluster upgrade
notes), and demote the alias format to a Legacy section.

Also add a forward pointer in MIGRATION.md's alias section pointing users
to the new structured fields.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
@thameezb thameezb force-pushed the feat/release-channel-impl branch from 23e1edd to becd88f Compare May 27, 2026 12:22
dm3ch added 5 commits May 27, 2026 14:27
…ource

Instead of requiring pre-created karpenter-specific pools by name, discover
the first RUNNING or RUNNING_WITH_ERROR pool alphabetically (preferring
default-pool). Create karpenter-fallback immediately on Sync failure rather
than waiting N retry cycles — the pool is zero-node, so no compute cost.

Add DefaultNodePoolTemplateName option to pin a specific pool when the cluster
has multiple candidates.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…rovisioning

Add PatchKubeEnvForOSType to adjust gke-os-distribution and BFQ scheduler
settings when the target image family differs from the source pool's OS.

Add PatchKubeEnvForArch (kubeenv_arch.go) to update SERVER_BINARY_TAR_URL and
SERVER_BINARY_TAR_HASH when the target arch differs from the source pool's arch.
The SHA-512 hash is fetched from the public GCS sidecar file and cached.

Wire both patches in instance.go via a patchKubeEnv helper (reduces cyclomatic
complexity of setupInstanceMetadata). Add unit tests for all cross-arch/OS paths.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Allow cgpv1 images so provisioned nodes can match the cgroup version the
cluster uses. Improve Ubuntu image pagination to avoid accumulating thousands
of deprecated images in memory on large clusters.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…Helm value

Adds DEFAULT_NODEPOOL_TEMPLATE_NAME operator flag and matching Helm value
so operators can pin Karpenter to a specific existing node pool for bootstrap
metadata rather than relying on auto-discovery.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…rap pools

Design record for PR cloudpilot-ai#263. Status: Implemented.
- proposals/ added to docs-lint CI path trigger and Makefile DOCS_FILES
- lefthook.yaml added to .gitignore (local-only file)
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
dm3ch added 28 commits May 27, 2026 14:27
…tions

- ResolveVersionForChannel: extract findChannelConfig and highestVersionForMinor
  helpers to reduce cyclomatic complexity below threshold
- image.go: add //nolint:staticcheck on term.Alias accesses (deprecated field
  kept for backward compat); add ctx to resolveImage for cancellation support
- CRDs: bump controller-gen annotation to v0.21.0 (local toolchain)

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
New test suite covers:
- COS family with channel: stable (resolves via GKE server config)
- COS family with version: latest (equivalent to existing alias path)
- Ubuntu2404 family with version: latest

New helpers in test/pkg/environment/helpers.go:
- CreateNodeClassWithFamilyChannel: imageSelectorTerm with family+channel
- CreateNodeClassWithFamilyVersion: imageSelectorTerm with family+version

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Document two divergences between proposal and initial implementation:

1. Success-path Status.Conditions messages are not yet emitted (only error
   paths produce condition messages); normal resolution path logging is
   deferred to a follow-up.

2. Ubuntu version-pin catalog miss currently surfaces as empty Status.Images
   (ImagesReady=False) rather than a specific error message; explicit
   catalog lookup for pinned dates is deferred to a follow-up.

Both cases set ImagesReady=False correctly; only the message granularity
differs from the proposal spec.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Missed during the rebase of bcc0c15 onto origin/main.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Remove unused cosVersionRe, extra blank line in instance.go, and
realign variable block in utils.go per gofmt.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Picks up gpuDriverVersion field and alias CEL validation rules that
were missing from the feature branch CRDs.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…Images

After rebase, the cosVersionRe guard was removed when resolving the
conflict for commit a69458d. The test TestResolveImages_COS_PinnedVersion_InvalidFormat
(from main via PR cloudpilot-ai#327) expects invalid milestone formats to return
imageResolutionError before hitting the compute API.

Restore the check in the milestone pin path, after parseGKEVersion()
returns false. GKE version strings (e.g. 1.34.6-gke.1068000) still
take the exact-build path and are unaffected.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…n_Valid

The test was written before the Ubuntu struct gained a release field.
With release="", deriveArm64Image falls through to the 2204 path which
inserts -arm64- before the version suffix instead of replacing -amd64-,
producing -amd64-arm64-v... instead of -arm64-v....

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
PR cloudpilot-ai#327 introduced image-management.md with alias as the primary approach.
PR cloudpilot-ai#342 adds the structured family/channel/version fields which supersede it.

Rewrite the doc to lead with the new structured selection modes (channel
tracking, version pin), keep the unique operational content (gcloud
version-discovery commands, disruption budget patterns, cluster upgrade
notes), and demote the alias format to a Legacy section.

Also add a forward pointer in MIGRATION.md's alias section pointing users
to the new structured fields.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
PULL_REQUEST_TEMPLATE.md and .github/workflows/docs.yaml were inadvertently
modified during branch development. Restore to origin/main state.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
CEL string literals don't accept \. as an escape sequence; use [.]
for literal dot matches. Also split the single Ubuntu family-in-list
rule (which exceeded cost budget by 2.8x) into two per-family rules.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Required for release-channel image resolution (GetServerConfig API call).
Also adds a dedicated MIGRATION.md section for the release channel feature
with the IAM note for manual-role users.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
CEL regex cost is proportional to the maximum string length. Without a
MaxLength constraint the Kubernetes cost estimator uses a very large default,
causing the .matches() rules to exceed the per-rule and schema-wide cost
budgets. 32 chars covers all valid version strings (milestone: 19 chars,
Ubuntu date: 9 chars, 'latest': 6 chars).

Also update MIGRATION.md: IAM note is informational not "action required"
since this release ships the permission in the standard role; alias
deprecation note explicitly says no forced removal in this release.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Pure computation (ResolveVersionForChannel and helpers) separated from
the API client in gke.go. Also demote Option 2 to non-recommended in
image-management.md — channel tracking is the single recommended path.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Mirrors the gke.go → version.go production split. Pure logic tests
(ResolveVersionForChannel) are now separate from the HTTP/cache tests.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Consolidates the family→ImageFamily mapping into a single map in the
constructor, removing the duplicate switch in getImageFamilyProvider.
Adding a new family is now a one-liner in NewDefaultProvider.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
… notice

Lead with the deprecation and migration path rather than describing the
new feature. Show before/after alias → structured field examples.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Accidentally dropped in an earlier commit. No functional change.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Mark alias as deprecated in troubleshooting; update all example
NodeClass and doc YAML snippets to use the structured fields.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…ions; remove no-op TrimPrefix

filterExistingImages now returns imageResolutionError when candidates
were resolved but none exist in GCP, restoring the ImagesReady=False
diagnostic for pinned-version misses that the old alias path provided.

Also remove the redundant TrimPrefix(m, "") in deriveArm64Image (2204
path) — TrimPrefix with an empty prefix is a no-op.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
Add post-Ready image version assertions:
- COS/version:latest: resolved image must equal independently-queried latest
- Ubuntu2404/version:latest: resolved image must contain current version token
- channel tests: resolved image must be from the expected GCP project
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
resolveExactBuildCOSImage already documented "a miss should surface as
ImagesReady=False" but returned a plain fmt.Errorf, bypassing the
IsImageResolutionError check in the status controller and causing silent
exponential backoff instead of an ImagesReady=False condition.
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…st, verify channel build

- Drop COS/version:latest spec: covered by provisioning suite's basic COS test
- Export ParseGKEVersion so tests can parse GKE version strings
- Add GetChannelVersion helper to Environment (fetches server config, calls
  ResolveVersionForChannel) so tests can determine the expected GKE build
- Replace weak project-prefix assertion for channel tests with exact build
  verification: resolved image must embed gke-{k8sKey}-gke{build}- from the
  version the server config returns for the requested channel
- Keep Ubuntu2404/version:latest with version-token assertion (different naming
  convention and project from COS; no equivalent provisioning coverage)
Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
ResolveVersionForChannel errors (unknown channel, no version for cluster
minor) and the UNSPECIFIED cluster channel case are all user-config
failures. Wrapping them as imageResolutionError surfaces them as
ImagesReady=False instead of triggering controller-runtime exponential
backoff.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
GCE VMs appear in instances.list within seconds of the create call, but
the lifecycle controller only writes providerID back to the NodeClaim after
the zone operation completes (30–90s). The previous 30s grace was too short:
a VM taking ~35s to provision would pass the grace window before its
NodeClaim was updated, causing the GC to delete it as "orphaned" and leaving
the NodeClaim permanently stuck with Launched=True but no backing instance.

3 minutes covers the full provisioning cycle in all cases and costs at most
one extra GC cycle before a genuine orphan is collected.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
…balanced

c4a machines require hyperdisk-balanced, not pd-balanced. All e2e NodeClasses
use pd-balanced by default, so every c4a attempt fails with a GCP 400 error
before falling through to t2a. Remove c4a from all four arm64 specs (COS and
Ubuntu, on-demand and spot) and add t2a-standard-4 as a second option.

Signed-off-by: Dmitry Chepurovskiy <me@dm3ch.net>
@thameezb thameezb force-pushed the feat/release-channel-impl branch from becd88f to 02a2261 Compare May 27, 2026 12:27
@thameezb thameezb merged commit 82a1324 into cloudpilot-ai:main May 27, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request kind api-change Something isn't working kind feature Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support GKE Upgrade Release Channels in node image selection

2 participants