Skip to content

feat: add Kubernetes environment (run game servers as Pods)#193

Open
Exonical wants to merge 6 commits into
pelican-dev:mainfrom
Exonical:feat/kubernetes-environment
Open

feat: add Kubernetes environment (run game servers as Pods)#193
Exonical wants to merge 6 commits into
pelican-dev:mainfrom
Exonical:feat/kubernetes-environment

Conversation

@Exonical

@Exonical Exonical commented Jun 20, 2026

Copy link
Copy Markdown

Adds an opt-in Kubernetes backend alongside Docker, gated behind kubernetes.enabled (default false). A new environment/kubernetes package implements the ProcessEnvironment interface: Pod lifecycle/power, SPDY attach/exec console, hostport/nodeport/loadbalancer networking, hostpath/PVC storage, optional ResourceQuota/LimitRange, metrics-API stats, and egg installation via a Job + ConfigMap.

Environment selection is gated in server/manager.go and server/install.go; system info and /api/system/ips become backend-aware in system/system.go and router/router_system.go. Docker behavior is unchanged when kubernetes.enabled is false.

Ships raw RBAC manifests under kubernetes/ and a Helm chart under chart/pelican-wings/ for deploying Wings in-cluster, plus a helm-lint workflow. The Kubernetes client libraries (k8s.io/* v0.36.1) require Go >= 1.26.0, so the go directive is bumped accordingly.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Kubernetes mode for Pelican Wings, including pod lifecycle (start/stop/terminate), attach/log streaming, NodePort/LoadBalancer networking, per-server storage (PVC/HostPath), resource quotas/limits, and resource stats with metrics-server and optional kubelet fallback.
    • Added Kubernetes-based installation via Jobs.
    • Added a Pelican Wings Helm chart with configurable RBAC, Service, networking/storage options, and runtime configuration.
  • Documentation
    • Added/updated Helm chart README and install/usage notes; expanded Kubernetes RBAC documentation.
  • CI/Tests/Chores
    • Added Helm linting in CI (chart changes only) and substantially expanded Kubernetes mode test coverage.
    • Bumped Go toolchain and refreshed dependencies.

Adds an opt-in Kubernetes backend alongside Docker, gated behind
kubernetes.enabled (default false). A new environment/kubernetes package
implements the ProcessEnvironment interface: Pod lifecycle/power, SPDY
attach/exec console, hostport/nodeport/loadbalancer networking,
hostpath/PVC storage, optional ResourceQuota/LimitRange, metrics-API
stats, and egg installation via a Job + ConfigMap.

Environment selection is gated in server/manager.go and server/install.go;
system info and /api/system/ips become backend-aware in system/system.go
and router/router_system.go. Docker behavior is unchanged when
kubernetes.enabled is false.

Ships raw RBAC manifests under kubernetes/ and a Helm chart under
chart/pelican-wings/ for deploying Wings in-cluster, plus a helm-lint
workflow. The Kubernetes client libraries (k8s.io/* v0.36.1) require
Go >= 1.26.0, so the go directive is bumped accordingly.
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60696606-30d9-4805-843a-dedb9a265d50

📥 Commits

Reviewing files that changed from the base of the PR and between 58c5717 and 8e5c303.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (55)
  • .github/workflows/helm-lint.yaml
  • chart/pelican-wings/Chart.yaml
  • chart/pelican-wings/README.md
  • chart/pelican-wings/templates/NOTES.txt
  • chart/pelican-wings/templates/_helpers.tpl
  • chart/pelican-wings/templates/clusterrole-metrics.yaml
  • chart/pelican-wings/templates/clusterrolebinding-metrics.yaml
  • chart/pelican-wings/templates/configmap.yaml
  • chart/pelican-wings/templates/deployment.yaml
  • chart/pelican-wings/templates/limitrange.yaml
  • chart/pelican-wings/templates/namespace.yaml
  • chart/pelican-wings/templates/pvc.yaml
  • chart/pelican-wings/templates/resourcequota.yaml
  • chart/pelican-wings/templates/role.yaml
  • chart/pelican-wings/templates/rolebinding.yaml
  • chart/pelican-wings/templates/service.yaml
  • chart/pelican-wings/templates/serviceaccount.yaml
  • chart/pelican-wings/values.yaml
  • config/config.go
  • config/config_kubernetes.go
  • environment/kubernetes/api.go
  • environment/kubernetes/container.go
  • environment/kubernetes/container_test.go
  • environment/kubernetes/environment.go
  • environment/kubernetes/environment_test.go
  • environment/kubernetes/identity.go
  • environment/kubernetes/installer.go
  • environment/kubernetes/installer_test.go
  • environment/kubernetes/network.go
  • environment/kubernetes/network_test.go
  • environment/kubernetes/node.go
  • environment/kubernetes/power.go
  • environment/kubernetes/power_test.go
  • environment/kubernetes/quota.go
  • environment/kubernetes/quota_test.go
  • environment/kubernetes/stats.go
  • environment/kubernetes/stats_test.go
  • environment/kubernetes/storage.go
  • environment/kubernetes/storage_test.go
  • environment/kubernetes/testutil_test.go
  • go.mod
  • kubernetes/README.md
  • kubernetes/clusterrole-metrics-kubelet.yaml
  • kubernetes/clusterrole-metrics.yaml
  • kubernetes/clusterrolebinding-metrics.yaml
  • kubernetes/namespace.yaml
  • kubernetes/role.yaml
  • kubernetes/rolebinding.yaml
  • kubernetes/serviceaccount.yaml
  • router/router_system.go
  • server/install.go
  • server/install_kubernetes.go
  • server/manager.go
  • server/update.go
  • system/system.go

📝 Walkthrough

Walkthrough

This PR introduces a full Kubernetes execution backend for Pelican Wings. It adds config types, a Kubernetes Environment implementation (pod lifecycle, power management, networking, storage, quota, stats, identity file provisioning, installer Job orchestration), wiring into the server manager and router, a Helm chart for deploying Wings itself, static RBAC manifests, and a Helm lint CI workflow.

Changes

Kubernetes Go Runtime Backend

Layer / File(s) Summary
Kubernetes config types and struct registration
config/config_kubernetes.go, config/config.go
Introduces KubernetesStorageMode/KubernetesNetworkMode enums with hostpath/pvc/hostport/nodeport/loadbalancer constants, the full KubernetesConfiguration struct with all YAML/JSON-tagged fields, KubeResourceQuota, KubeLimitRange, KubeToleration structs, and registers the Kubernetes field on the top-level Configuration struct.
Kubernetes client singleton and Environment struct
environment/kubernetes/api.go, environment/kubernetes/environment.go, environment/kubernetes/environment_test.go, environment/kubernetes/testutil_test.go
Adds Client() with sync.Once initialization (kubeconfig or in-cluster), the Metadata struct, Environment implementing ProcessEnvironment, New() constructor, state accessors with event publication, Exists/IsRunning/ExitState/InSituUpdate, and tests covering all accessors and lifecycle queries via a fake client.
Pod Create/Destroy, identity, and volume helpers
environment/kubernetes/container.go, environment/kubernetes/identity.go, environment/kubernetes/container_test.go
Implements resolveImagePullPolicy, Environment.Create (Pod spec with init container, resource limits, identity ConfigMap, quota/PVC pre-creation, NodePort service), Environment.Destroy (zero-grace delete with best-effort cleanup), Attach/SendCommand/Readlog, buildResources/buildVolumes/buildContainerPorts helpers, and identity ConfigMap provisioning with passwd/group/machine-id generation. Tests cover Create idempotency, resource conversions, volume naming, and port generation.
Pod power lifecycle: Start/Stop/WaitForStop/Terminate
environment/kubernetes/power.go, environment/kubernetes/power_test.go
Implements OnBeforeStart (delete+recreate), Start (attach if running, else create+watch+attach with CrashLoopBackOff/ImagePullBackOff detection), Stop (signal/command/grace-period based on stop config), WaitForStop (short-circuit when gone, timeout coordination, force SIGKILL on deadline), Terminate (force grace=0), waitForPodRunning, and waitForPodDeletion. Tests validate early-exit behavior for missing/non-running pods.
Service management: NodePort/LoadBalancer/HostPort
environment/kubernetes/network.go, environment/kubernetes/node.go, environment/kubernetes/network_test.go
Implements EnsureService (deduplicated TCP+UDP ports, NodePort/LB service type, LB annotations with IP-pinning/sharing-key, mergeServicePorts to preserve NodePort assignments), DeleteService, GetServiceNodePorts, GetServiceExternalIP, WaitForLoadBalancerIP, GetNodeIPs, GetLoadBalancerIPs, and port name sanitization helpers. Tests cover all network modes and annotation edge cases.
PVC lifecycle: EnsurePVC/DeletePVC/GetPVCStatus/ResizePVC
environment/kubernetes/storage.go, environment/kubernetes/storage_test.go
Implements pvcName helper, EnsurePVC (mode guard, storage-class/access-mode wiring, idempotent create), DeletePVC (not-found tolerant), GetPVCStatus, ResizePVC, and buildVolumes interaction with PVC vs HostPath modes. Tests cover all PVC mode transitions and invalid input handling.
ResourceQuota and LimitRange enforcement
environment/kubernetes/quota.go, environment/kubernetes/quota_test.go
Defines quota/limitrange name constants, implements EnsureResourceQuota and EnsureLimitRange (get-then-create-or-update), buildResourceQuota and buildLimitRange helpers populating only non-empty config fields. Tests cover disabled no-op, creation, update, and sparse-field builder behavior.
Resource stats polling: Metrics API and kubelet fallback
environment/kubernetes/stats.go, environment/kubernetes/stats_test.go
Implements Uptime, pollResources (ticker loop, Metrics API primary, kubelet stats/summary fallback, ResourceEvent publication), metrics parsing helpers for both sources, parseResourceQuantity, and parseCPUToPercent. Tests cover all CPU/memory formats, nil-pointer safety, and error paths.
Installer Job orchestration and server wiring
environment/kubernetes/installer.go, environment/kubernetes/installer_test.go, server/install.go, server/install_kubernetes.go
Defines InstallerProcess and NewInstallerProcess, implements WriteInstallScript (CRLF-normalized ConfigMap), Run (Job creation with volumes/env/node-selector/tolerations, async log streaming, completion polling), Cleanup, buildJobVolumes/buildServerDataMount, and internalInstallKubernetes which routes Kubernetes-enabled installations to a Job instead of a Docker container. Tests validate Job spec, ConfigMap wiring, storage mode variations, and cleanup behavior.
Server manager, router, and system utility wiring
server/manager.go, router/router_system.go, system/system.go, go.mod, server/update.go
Updates InitServer to branch on Kubernetes.Enabled for environment construction, adds Kubernetes-aware IP collection to getSystemIps (LB/node IPs + static list), adds GetSystemInformationWithOptions with a Kubernetes path that skips Docker, updates GetSystemIps to accept all non-loopback non-link-local IPs, bumps Go toolchain to 1.26 with Kubernetes client-go module additions, and updates SyncWithEnvironment for Kubernetes environment image/stop sync.

Helm Chart and Static RBAC Manifests

Layer / File(s) Summary
Chart metadata, template helpers, and values
chart/pelican-wings/Chart.yaml, chart/pelican-wings/templates/_helpers.tpl, chart/pelican-wings/values.yaml
Defines chart identity/versioning, all _helpers.tpl named templates (name/fullname/chart/labels/selectorLabels/serviceAccountName), and the full values.yaml with defaults for image, RBAC, game namespace, wings settings, security contexts, persistence, ResourceQuota, and LimitRange.
Core workload templates: Deployment, ConfigMap, Service, NOTES
chart/pelican-wings/templates/deployment.yaml, chart/pelican-wings/templates/configmap.yaml, chart/pelican-wings/templates/service.yaml, chart/pelican-wings/templates/NOTES.txt
Adds Deployment template (init container copying config, wings container with probes, optional PVC mount, security contexts, extra volumes), ConfigMap template generating config.yml from Helm values with conditional Kubernetes sections, Service template exposing api+sftp ports, and deployment-guidance NOTES.txt with conditional NodePort/ClusterIP URL generation.
RBAC, namespace, and quota templates
chart/pelican-wings/templates/namespace.yaml, chart/pelican-wings/templates/serviceaccount.yaml, chart/pelican-wings/templates/role.yaml, chart/pelican-wings/templates/rolebinding.yaml, chart/pelican-wings/templates/clusterrole-metrics.yaml, chart/pelican-wings/templates/clusterrolebinding-metrics.yaml, chart/pelican-wings/templates/resourcequota.yaml, chart/pelican-wings/templates/limitrange.yaml, chart/pelican-wings/templates/pvc.yaml
Adds conditional Helm templates for Namespace, ServiceAccount, Role (pods/services/jobs/configmaps/PVC permissions), RoleBinding, ClusterRole for metrics access, ClusterRoleBinding, ResourceQuota, LimitRange, and Wings-data PVC—all gated by their respective Values.* toggles.
Chart README, static RBAC manifests, RBAC docs, and CI workflow
chart/pelican-wings/README.md, kubernetes/namespace.yaml, kubernetes/serviceaccount.yaml, kubernetes/role.yaml, kubernetes/rolebinding.yaml, kubernetes/clusterrole-metrics.yaml, kubernetes/clusterrolebinding-metrics.yaml, kubernetes/clusterrole-metrics-kubelet.yaml, kubernetes/README.md, .github/workflows/helm-lint.yaml
Adds the Helm chart README with quick-start, storage/networking/image-pull docs, and uninstall instructions; adds static raw RBAC manifests in kubernetes/ (namespace, serviceaccount, role, rolebinding, clusterrole-metrics, clusterrolebinding-metrics, optional clusterrole-metrics-kubelet) with a README covering permission tables and customization; adds the helm-lint GitHub Actions CI workflow triggered on chart file changes.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Hoppity-hop through the cluster so wide,
Pods spinning up with a Wings daemon inside!
ConfigMaps carry the passwd and group,
A NodePort Service closes the loop.
Helm charts and RBAC, all tidy and neat—
This bunny ships Kubernetes, what a feat! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat: add Kubernetes environment (run game servers as Pods)' clearly and directly describes the main change: adding a Kubernetes backend environment for running game servers as Kubernetes Pods.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@Exonical Exonical marked this pull request as ready for review June 20, 2026 04:17
@Exonical Exonical requested a review from a team as a code owner June 20, 2026 04:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (29)
environment/kubernetes/container.go-255-259 (1)

255-259: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only transition to offline after successful Pod deletion handling.

At Line 255, state is set to offline before checking whether pod deletion actually failed. On non-NotFound delete errors, control-plane state becomes inaccurate while the pod may still be running.

Proposed fix
-	e.SetState(environment.ProcessOfflineState)
-
 	if err != nil && !isNotFound(err) {
 		return errors.Wrap(err, "environment/kubernetes: failed to delete pod")
 	}
+	e.SetState(environment.ProcessOfflineState)

 	return nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/container.go` around lines 255 - 259, The state is
being set to ProcessOfflineState before validating the pod deletion outcome,
which causes control-plane state to become inaccurate when deletion fails with a
non-NotFound error. Move the e.SetState(environment.ProcessOfflineState) call to
execute only after the error check succeeds, so that the offline state
transition happens only when deletion is successful or when the pod is already
gone (NotFound error). This ensures state accuracy by deferring the state change
until after confirming the deletion operation completed as expected.
environment/kubernetes/container.go-49-50 (1)

49-50: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle Exists() errors explicitly in Create().

At Line 49, the error from e.Exists() is discarded. If the API call fails (RBAC/network/etc.), Create() continues and masks the root cause with a secondary create error.

Proposed fix
-	if exists, _ := e.Exists(); exists {
+	exists, err := e.Exists()
+	if err != nil {
+		return errors.Wrap(err, "environment/kubernetes: failed to check pod existence")
+	}
+	if exists {
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/container.go` around lines 49 - 50, In the Create()
method, the error returned by e.Exists() is being ignored using a blank
identifier. Instead of discarding this error with the underscore, capture it in
a variable and explicitly check if it is not nil. If an error occurs during the
Exists() call (due to RBAC, network issues, etc.), return that error immediately
rather than continuing execution, which would otherwise mask the root cause with
a potentially different create error downstream.
environment/kubernetes/container.go-562-564 (1)

562-564: ⚠️ Potential issue | 🟠 Major

Use typed Kubernetes NotFound checks instead of string matching.

The current implementation at lines 562-563 matches "not found" via err.Error(), which is brittle and can misclassify errors. Use apierrors.IsNotFound(err) from k8s.io/apimachinery/pkg/api/errors instead. This function is widely used across the codebase (13 call sites), so fixing it improves error handling consistency.

Add the import:

apierrors "k8s.io/apimachinery/pkg/api/errors"

Then update the function:

func isNotFound(err error) bool {
	return apierrors.IsNotFound(err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/container.go` around lines 562 - 564, The isNotFound
function uses brittle string matching to check for Kubernetes not found errors
instead of using the proper API errors package. Add the import for apierrors
from k8s.io/apimachinery/pkg/api/errors at the top of the file, then replace the
string matching logic in the isNotFound function with a call to
apierrors.IsNotFound(err) which provides consistent and reliable error type
checking across the codebase.
environment/kubernetes/container.go-517-542 (1)

517-542: ⚠️ Potential issue | 🟠 Major

Deduplicate allocation ports in container spec to avoid Kubernetes update failures.

buildContainerPorts() appends ports per map entry without deduping. When the same port number appears under multiple allocation IPs, duplicate TCP/UDP entries are created. While Kubernetes may accept such duplicates at Pod creation, subsequent update or patch operations frequently fail due to merge key conflicts in the reconciliation logic. This mirrors the deduplication pattern already applied in EnsureService() (network.go:44-51).

Proposed fix
 func (e *Environment) buildContainerPorts() []corev1.ContainerPort {
 	cfg := config.Get()
 	allocs := e.Configuration.Allocations()
 	var ports []corev1.ContainerPort
+	seen := make(map[int]struct{})

 	for _, allocPorts := range allocs.Mappings {
 		for _, port := range allocPorts {
 			if port < 1 || port > 65535 {
 				continue
 			}
+			if _, ok := seen[port]; ok {
+				continue
+			}
+			seen[port] = struct{}{}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/container.go` around lines 517 - 542, The
buildContainerPorts() function creates duplicate TCP/UDP ContainerPort entries
when the same port number appears under multiple allocation IPs, which causes
Kubernetes update/patch operations to fail. Implement port deduplication in
buildContainerPorts() by tracking which ports have already been processed (using
a map keyed by port number) before appending tcpPort and udpPort to the ports
slice, following the same deduplication pattern already applied in
EnsureService() function to ensure each unique port is only added once
regardless of how many allocation entries reference it.
environment/kubernetes/power.go-91-93 (1)

91-93: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Signal-based stop currently force-kills immediately.

Line 92 calls Terminate(..., "SIGTERM"), but Terminate (Line 187-190) always uses GracePeriodSeconds=0. This skips graceful shutdown and can cause abrupt process termination.

Suggested fix
-	// If using a signal-based stop, terminate with that signal.
+	// If using signal-based stop, map to graceful/force delete semantics.
 	if s.Type == "" || s.Type == remote.ProcessStopSignal {
-		return e.Terminate(ctx, "SIGTERM")
+		gracePeriod := int64(30)
+		if strings.EqualFold(s.Value, "SIGKILL") {
+			gracePeriod = 0
+		}
+		err := e.client.CoreV1().Pods(e.namespace()).Delete(ctx, e.Id, metav1.DeleteOptions{
+			GracePeriodSeconds: &gracePeriod,
+		})
+		if err != nil && !isNotFound(err) {
+			return errors.Wrap(err, "environment/kubernetes: failed to stop pod")
+		}
+		return nil
 	}

Also applies to: 169-190

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/power.go` around lines 91 - 93, The issue is that when
stopping a process with a signal (when s.Type is empty or equals
remote.ProcessStopSignal), the code calls Terminate with "SIGTERM" but the
Terminate method always sets GracePeriodSeconds=0, which force-kills the process
immediately instead of allowing graceful shutdown. To fix this, modify the
Terminate method to accept a grace period parameter (or create a separate
graceful termination method) and pass an appropriate grace period value (e.g.,
30 seconds) when calling Terminate from the signal-based stop path, ensuring
that SIGTERM signals allow time for graceful process shutdown rather than
immediate termination.
environment/kubernetes/power.go-156-163 (1)

156-163: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

WaitForStop should treat non-running Pods as stopped, not only deleted.

After command-based stop, the Pod can remain in Succeeded/Failed without being deleted. Line 157 currently waits only for deletion, so stop/restart can block until timeout and then force terminate.

Suggested fix
-	// Wait for the Pod to be gone.
-	if err := e.waitForPodDeletion(tctx, duration); err != nil {
+	// Wait for the Pod to be gone or no longer running.
+	if err := e.waitForPodStoppedOrDeleted(tctx, duration); err != nil {
 		if terminate {
 			e.log().Warn("pod did not terminate in time, forcing deletion")
 			return e.Terminate(ctx, "SIGKILL")
@@
 	return nil
 }
+
+func (e *Environment) waitForPodStoppedOrDeleted(ctx context.Context, timeout time.Duration) error {
+	dctx, cancel := context.WithTimeout(ctx, timeout)
+	defer cancel()
+
+	ticker := time.NewTicker(500 * time.Millisecond)
+	defer ticker.Stop()
+
+	for {
+		select {
+		case <-dctx.Done():
+			return dctx.Err()
+		case <-ticker.C:
+			pod, err := e.getPod(dctx)
+			if err != nil && isNotFound(err) {
+				return nil
+			}
+			if err == nil && !isPodRunning(pod) {
+				e.markOffline()
+				return nil
+			}
+		}
+	}
+}

Also applies to: 264-282

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/power.go` around lines 156 - 163, The
waitForPodDeletion call in the WaitForStop method only waits for complete pod
deletion but does not account for pods that have transitioned to terminal states
like Succeeded or Failed. Modify the logic to check not only for pod deletion
but also for the pod being in a non-running terminal state (Succeeded/Failed
status). The waiting condition should return success when either the pod is
deleted OR the pod reaches a terminal non-running state, preventing unnecessary
blocking until timeout when the pod has already stopped executing.
environment/kubernetes/power.go-23-29 (1)

23-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle Pod delete/wait errors explicitly before recreate.

Line 23 and Line 28 currently drop errors; this can mask RBAC/API failures and turn them into opaque Create failures later.

Suggested fix
 	gracePeriod := int64(0)
-	_ = e.client.CoreV1().Pods(e.namespace()).Delete(ctx, e.Id, metav1.DeleteOptions{
+	if err := e.client.CoreV1().Pods(e.namespace()).Delete(ctx, e.Id, metav1.DeleteOptions{
 		GracePeriodSeconds: &gracePeriod,
-	})
+	}); err != nil && !isNotFound(err) {
+		return errors.Wrap(err, "environment/kubernetes: failed to delete existing pod before start")
+	}
 
 	// Wait briefly for deletion to propagate.
-	e.waitForPodDeletion(ctx, 10*time.Second)
+	if err := e.waitForPodDeletion(ctx, 10*time.Second); err != nil {
+		return errors.Wrap(err, "environment/kubernetes: timed out waiting for old pod deletion")
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/power.go` around lines 23 - 29, The Delete call on
e.client.CoreV1().Pods(e.namespace()).Delete() and the waitForPodDeletion()
method both currently ignore errors by using underscore assignments. Instead of
discarding these errors, capture the return values from both the Delete
operation and the waitForPodDeletion call, then add explicit error handling to
check if either operation returns an error. If an error occurs, handle it
appropriately (such as logging and returning early) to prevent masking RBAC or
API failures that would otherwise appear later as opaque errors during
subsequent Create operations.
kubernetes/clusterrole-metrics.yaml-20-24 (1)

20-24: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make kubelet proxy permissions opt-in.

Granting nodes/proxy (get) cluster-wide materially increases blast radius; a compromised Wings identity can proxy kubelet endpoints across nodes. Keep metrics.k8s.io/pods as default and move kubelet proxy access to a separate optional manifest/binding.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kubernetes/clusterrole-metrics.yaml` around lines 20 - 24, Remove the kubelet
proxy permission block (the apiGroups, resources, and verbs entries for
nodes/proxy) from the clusterrole-metrics.yaml file to make it the default
minimal permission set. Create a separate optional manifest file for kubelet
proxy access that users can apply when they need to proxy kubelet endpoints
across nodes. This keeps the default configuration least-privileged while
allowing users to opt-in to the higher blast radius permission if needed.

Source: Linters/SAST tools

kubernetes/README.md-112-114 (1)

112-114: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Multiple-instance guidance is incorrect for ClusterRoleBinding usage.

Line 113 says to share the same ClusterRoleBinding across namespaces, but ClusterRoleBinding subjects are namespaced ServiceAccounts. Additional Wings instances in other namespaces need either separate ClusterRoleBindings or additional subjects entries.

Suggested README wording
-If you run multiple Wings nodes targeting different namespaces, create a Role and
-RoleBinding per namespace, but share the ClusterRole/ClusterRoleBinding (it's
-namespace-independent).
+If you run multiple Wings nodes targeting different namespaces, create a Role and
+RoleBinding per namespace and bind each ServiceAccount to the shared ClusterRole
+(either one ClusterRoleBinding per ServiceAccount or a single binding with multiple subjects).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kubernetes/README.md` around lines 112 - 114, The guidance in the section
about running multiple Wings nodes targeting different namespaces is incorrect
regarding ClusterRoleBinding usage. Update lines 112-114 to clarify that
ClusterRoleBinding subjects reference namespaced ServiceAccounts, so when
running multiple Wings instances in different namespaces, you must either create
separate ClusterRoleBindings for each namespace or add multiple subject entries
to a single shared ClusterRoleBinding that references the ServiceAccount from
each namespace. Remove the incorrect suggestion that a single ClusterRoleBinding
can be shared across namespaces without modification.
chart/pelican-wings/README.md-15-22 (1)

15-22: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid putting Panel credentials directly in CLI flags.

Using --set wings.token=... and --set wings.tokenId=... exposes credentials in shell history and process lists. Prefer a local values file (or secret-backed workflow) for the quick-start example.

Suggested README update
-helm install wings ./chart/pelican-wings \
-  --set wings.panelUrl=https://panel.example.com \
-  --set wings.token=YOUR_TOKEN \
-  --set wings.tokenId=YOUR_TOKEN_ID \
-  --set wings.uuid=YOUR_NODE_UUID
+# values.local.yaml (do not commit)
+wings:
+  panelUrl: https://panel.example.com
+  token: YOUR_TOKEN
+  tokenId: YOUR_TOKEN_ID
+  uuid: YOUR_NODE_UUID
+
+helm install wings ./chart/pelican-wings -f values.local.yaml
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/README.md` around lines 15 - 22, The helm install example
exposes sensitive credentials in CLI flags (wings.token and wings.tokenId) which
appear in shell history and process lists. Replace the direct --set flags
approach with an example using a local values file that keeps credentials
separate from the command line invocation. Create a sample values file snippet
in the README showing how to define these sensitive parameters (wings.token,
wings.tokenId, wings.uuid) securely, then update the helm install command to
reference this file using the -f flag instead of exposing the values directly.
kubernetes/README.md-34-51 (1)

34-51: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Required-permissions tables are out of sync with the manifests.

The table omits granted permissions (pods: watch, services: list, configmaps: update) and entirely excludes cluster-scoped nodes/nodes/proxy access defined in kubernetes/clusterrole-metrics.yaml. This weakens security review accuracy for operators.

Suggested README table corrections
-| pods                      | get, create, delete, list      | Game server Pod lifecycle                |
+| pods                      | get, create, delete, list, watch | Game server Pod lifecycle              |
 ...
-| services                  | get, create, update, delete    | NodePort Service management              |
+| services                  | get, create, update, delete, list | NodePort/LB Service management        |
 ...
-| configmaps                | get, create, delete            | Install script storage (multi-node)      |
+| configmaps                | get, create, update, delete    | Install script + identity management     |

 ### Cluster-scoped
-| Resource                              | Verbs | Purpose                         |
-|---------------------------------------|-------|---------------------------------|
-| pods (metrics.k8s.io/v1beta1)         | get   | CPU/memory usage polling        |
+| Resource                              | Verbs | Purpose                                   |
+|---------------------------------------|-------|-------------------------------------------|
+| pods (metrics.k8s.io/v1beta1)         | get   | CPU/memory usage polling                  |
+| nodes                                 | get   | Node IP discovery                         |
+| nodes/proxy                           | get   | Kubelet stats fallback (if enabled)       |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kubernetes/README.md` around lines 34 - 51, Update the Kubernetes RBAC
permissions table in the README.md to match the actual manifests defined in
kubernetes/clusterrole-metrics.yaml. Add the missing verbs to existing
resources: pods (add watch), services (add list), and configmaps (add update) in
the namespace-scoped section. Additionally, add a new row in the cluster-scoped
section documenting the nodes and nodes/proxy resources with their granted verbs
(get for nodes and get for nodes/proxy) to ensure the documentation accurately
reflects all permissions granted by the cluster role.
.github/workflows/helm-lint.yaml-21-21 (1)

21-21: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable credential persistence in checkout step.

Line 21 should set persist-credentials: false to avoid leaving the GitHub token in local git config for later steps.

Suggested hardening patch
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v7
+        with:
+          persist-credentials: false
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/helm-lint.yaml at line 21, The actions/checkout@v7 step on
line 21 needs to include the persist-credentials parameter set to false to
prevent the GitHub token from remaining in the local git configuration after the
checkout completes. Add a with block to the actions/checkout@v7 step and
configure persist-credentials: false within it to disable credential persistence
and improve security.

Source: Linters/SAST tools

.github/workflows/helm-lint.yaml-21-23 (1)

21-23: ⚠️ Potential issue | 🟠 Major

Pin GitHub Actions to immutable commit SHAs.

Lines 21 and 23 use mutable tags (@v7, @v5), which are a supply-chain risk. Pinning to commit SHAs ensures workflows execute only reviewed code and complies with security hardening standards.

Suggested hardening patch
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@1044a6dea927916f2c38ba5aeffbc0a847b1221a # v7.0.0

-      - uses: azure/setup-helm@v5
+      - uses: azure/setup-helm@dda3372f752e03dde6b3237bc9431cdc2f7a02a2 # v5.0.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/helm-lint.yaml around lines 21 - 23, Replace the mutable
version tags with immutable commit SHAs for the GitHub Actions used in the
workflow. For the actions/checkout action (currently pinned to `@v7`) and the
azure/setup-helm action (currently pinned to `@v5`), replace the `@v` version tags
with their corresponding full commit SHAs. This ensures the workflow executes
only the reviewed code and complies with security hardening standards by
preventing unexpected changes from action updates.

Source: Linters/SAST tools

chart/pelican-wings/templates/configmap.yaml-12-15 (1)

12-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not store panel credentials in a ConfigMap.

Line 14 writes token into config.yml inside a ConfigMap. That exposes sensitive credentials through non-secret config objects.

🛡️ Suggested secure approach
-apiVersion: v1
-kind: ConfigMap
+apiVersion: v1
+kind: Secret
+type: Opaque
 ...
-data:
+stringData:
   config.yml: |

Also update chart/pelican-wings/templates/deployment.yaml volume source from configMap to secret for config-source.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/templates/configmap.yaml` around lines 12 - 15, The token
field in the ConfigMap exposes sensitive credentials through a non-secret
configuration object, which is a security risk. Move the sensitive `token` field
from the ConfigMap in configmap.yaml to a Kubernetes Secret resource instead.
Additionally, update the volume source for `config-source` in the
deployment.yaml file to reference the new Secret instead of the ConfigMap,
ensuring that sensitive credentials are properly protected as secret objects
rather than exposed in plain ConfigMaps.

Source: Linters/SAST tools

chart/pelican-wings/templates/configmap.yaml-29-30 (1)

29-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent namespace drift between chart resources and runtime namespace config.

Line 29 uses .Values.wings.kubernetes.namespace, while chart resources are rendered into .Values.gameNamespace (Line 5). If those diverge, Wings can target a namespace where RBAC/resources are not provisioned.

✅ Suggested contract fix
-      namespace: {{ .Values.wings.kubernetes.namespace | quote }}
+      namespace: {{ .Values.gameNamespace | quote }}
+{{- if ne .Values.gameNamespace .Values.wings.kubernetes.namespace }}
+{{- fail "gameNamespace and wings.kubernetes.namespace must match" }}
+{{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/templates/configmap.yaml` around lines 29 - 30, The
namespace configuration in the Wings runtime setup uses
`.Values.wings.kubernetes.namespace` on line 29, but chart resources are
deployed into `.Values.gameNamespace` (line 5), creating a namespace drift where
Wings might target a namespace without the necessary RBAC or resources. Replace
the reference from `.Values.wings.kubernetes.namespace` with
`.Values.gameNamespace` in the namespace field to ensure runtime and provisioned
resources target the same namespace consistently.
chart/pelican-wings/templates/limitrange.yaml-1-38 (1)

1-38: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when LimitRange is enabled with no limits configured.

limitRange.enabled=true currently renders even when all limit fields are empty, producing a near-empty LimitRange item. Add a precondition requiring at least one configured field to avoid invalid/no-op installs.

Suggested fix
 {{- if .Values.limitRange.enabled }}
+{{- if not (or .Values.limitRange.defaultCPULimit .Values.limitRange.defaultMemoryLimit .Values.limitRange.defaultCPURequest .Values.limitRange.defaultMemoryRequest .Values.limitRange.maxCPU .Values.limitRange.maxMemory) }}
+{{- fail "limitRange.enabled=true requires at least one limitRange.* field" }}
+{{- end }}
 apiVersion: v1
 kind: LimitRange
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/templates/limitrange.yaml` around lines 1 - 38, The
LimitRange template currently renders whenever limitRange.enabled is true, even
if all limit configuration fields are empty, resulting in an invalid or no-op
resource. Modify the initial condition check (where `if
.Values.limitRange.enabled` is evaluated) to add an additional validation that
ensures at least one of the limit fields is configured (defaultCPULimit,
defaultMemoryLimit, defaultCPURequest, defaultMemoryRequest, maxCPU, or
maxMemory) before rendering the entire LimitRange resource. This will fail fast
and prevent creating empty LimitRange objects.
chart/pelican-wings/templates/clusterrolebinding-metrics.yaml-9-11 (1)

9-11: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent implicit binding to the default ServiceAccount.

With the current helper fallback, serviceAccount.create=false + empty serviceAccount.name binds this ClusterRole to default, unintentionally granting cluster-level node/metrics access to any pod using that SA in gameNamespace. Require an explicit SA name when not creating one.

Suggested fix
+{{- if and (not .Values.serviceAccount.create) (not .Values.serviceAccount.name) }}
+{{- fail "serviceAccount.name must be set when serviceAccount.create=false" }}
+{{- end }}
 {{- if .Values.rbac.create }}
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/templates/clusterrolebinding-metrics.yaml` around lines 9
- 11, The serviceAccountName helper function in the ClusterRoleBinding has a
fallback behavior that implicitly binds to the default ServiceAccount when
serviceAccount.create is false and serviceAccount.name is empty, which
unintentionally grants cluster-level permissions. Modify the serviceAccountName
helper function to require an explicit service account name when
serviceAccount.create is set to false, preventing any fallback to the default
ServiceAccount and raising an error or validation failure if the name is not
explicitly provided in the values configuration.
chart/pelican-wings/templates/resourcequota.yaml-1-31 (1)

1-31: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard against empty spec.hard when quota is enabled.

If resourceQuota.enabled=true but all resource fields are empty, this renders an empty hard block and the resource can fail validation at apply time. Add a precondition requiring at least one hard limit field.

Suggested fix
 {{- if .Values.resourceQuota.enabled }}
+{{- if not (or .Values.resourceQuota.cpuLimit .Values.resourceQuota.memoryLimit .Values.resourceQuota.cpuRequest .Values.resourceQuota.memoryRequest .Values.resourceQuota.maxPods .Values.resourceQuota.maxPVCs .Values.resourceQuota.maxStorage) }}
+{{- fail "resourceQuota.enabled=true requires at least one resourceQuota.* limit/request field" }}
+{{- end }}
 apiVersion: v1
 kind: ResourceQuota
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/templates/resourcequota.yaml` around lines 1 - 31, The
ResourceQuota template renders an empty spec.hard block when
resourceQuota.enabled is true but all individual resource quota fields
(cpuLimit, memoryLimit, cpuRequest, memoryRequest, maxPods, maxPVCs, maxStorage)
are disabled or empty, causing Kubernetes validation failures. Modify the outer
condition that checks .Values.resourceQuota.enabled to also verify that at least
one of the hard limit fields is configured before rendering the entire
ResourceQuota resource, preventing the creation of invalid manifests with empty
hard specifications.
router/router_system.go-111-114 (1)

111-114: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Kubernetes IP discovery failures are silently converted into a successful empty response.

When node/LB queries fail, the handler only logs warnings and still returns 200 (Line 142), unlike the non-Kubernetes path that aborts on discovery errors. This can surface as “no assignable IPs” without an actionable API error.

Also applies to: 142-142

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@router/router_system.go` around lines 111 - 114, The kubernetes.GetNodeIPs()
call in the handler is logging warnings on errors but still proceeding to return
a successful 200 response with empty results, creating inconsistency with the
non-Kubernetes path which aborts on discovery failures. Modify the error
handling in the block where kubernetes.GetNodeIPs(ctx) is called to return an
appropriate error response (likely an HTTP error status) instead of silently
continuing when the error occurs, ensuring consistent behavior between the
Kubernetes and non-Kubernetes code paths and preventing empty IP lists from
being treated as successful responses.
server/manager.go-215-217 (1)

215-217: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Kubernetes runtime metadata is missing stop configuration.

Line 215 initializes kubernetes.Metadata with only Image. The Kubernetes metadata contract includes stop settings used by lifecycle logic; omitting it can change shutdown behavior (signal/timeout) versus server config.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/manager.go` around lines 215 - 217, The kubernetes.Metadata
initialization is missing stop configuration which controls shutdown behavior.
In the metadata struct creation where Image is set, add the stop settings from
the server configuration (likely s.Config().Stop or similar) to ensure the
Kubernetes metadata includes the complete lifecycle settings including signal
and timeout values that match the server config.
environment/kubernetes/installer.go-304-310 (1)

304-310: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

WriteInstallScript ignores caller cancellation/timeouts.

Line 304 hardcodes context.Background(), so ConfigMap operations cannot be canceled with the server/install context. This can hang installs during API server/network issues.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/installer.go` around lines 304 - 310, The
WriteInstallScript method hardcodes context.Background() for ConfigMap
operations, which ignores the caller's context and prevents cancellation or
timeout propagation. Replace the hardcoded context.Background() assignment with
the appropriate context that's passed to the function through its parameters or
receiver, and ensure this context is used in both the Delete and Create calls on
the ConfigMaps client to respect the caller's cancellation signals and timeouts.
environment/kubernetes/installer.go-374-380 (1)

374-380: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SubPath may become invalid when data/root paths diverge.

On Line 375, filepath.Rel can produce values like ../... when System.Data is outside System.RootDirectory. Using that as VolumeMount.SubPath makes the Pod invalid and the installer will not start.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/installer.go` around lines 374 - 380, The relative
path calculated by filepath.Rel can contain `..` components when System.Data is
located outside System.RootDirectory, which makes it invalid for Kubernetes
VolumeMount.SubPath. In the else block where vm.SubPath is assigned the rel
value, add validation to check if the relative path contains `..` components. If
it does, fall back to using the default path format (filepath.Join("volumes",
ip.ServerID)) instead of the invalid relative path, similar to what is done in
the error case.
environment/kubernetes/installer.go-66-69 (1)

66-69: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Delete-then-create can race on installer Job name reuse.

Line 66 uses background deletion and Line 169 immediately recreates the same Job name. If the old Job is still terminating, Create can fail with AlreadyExists, causing transient install failures.

Also applies to: 169-170

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/installer.go` around lines 66 - 69, The Delete call at
line 66 uses propagationBackground() which causes asynchronous deletion,
creating a race condition when the same Job is immediately recreated at line
169. The old Job may still be terminating when Create is called, resulting in
AlreadyExists errors. Change the propagation policy from propagationBackground()
to a foreground/blocking policy (such as propagationForeground()) to ensure the
Job deletion completes before the Delete call returns, allowing the subsequent
Create call to safely recreate the Job without conflicts.
go.mod-3-3 (1)

3-3: ⚠️ Potential issue | 🟠 Major

Update Dockerfile and CI workflow Go version to 1.26.0 to match go.mod.

The Dockerfile uses golang:1.25.7-alpine and at least one CI workflow is hardcoded to go-version: "1.25.7", but go.mod now requires Go 1.26.0. This mismatch can cause build failures and inconsistent behavior across local, CI, and production environments.

Current misalignment
go.mod:             go 1.26.0
Dockerfile:         FROM golang:1.25.7-alpine
CI workflow:        go-version: "1.25.7"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go.mod` at line 3, The Go version in go.mod is now 1.26.0 but the Dockerfile
and CI workflow still reference the older 1.25.7 version, causing environment
mismatches. Update the Dockerfile's FROM statement to use golang:1.26.0-alpine
instead of golang:1.25.7-alpine, and update any CI workflow files (such as those
containing go-version configuration) to set go-version to "1.26.0" instead of
"1.25.7" to ensure consistency across all build environments.
environment/kubernetes/quota.go-30-44 (1)

30-44: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only create quota resources on NotFound; fail fast on other Get errors.

In both EnsureResourceQuota and EnsureLimitRange, create is attempted after any Get failure. On Line 30-Line 44 and Line 61-Line 75, non-NotFound errors (RBAC, transport, timeout) should return immediately instead of falling through to create.

Proposed fix
  existing, err := e.client.CoreV1().ResourceQuotas(ns).Get(ctx, quotaName, metav1.GetOptions{})
  if err == nil {
      existing.Spec = quota.Spec
      _, err = e.client.CoreV1().ResourceQuotas(ns).Update(ctx, existing, metav1.UpdateOptions{})
      if err != nil {
          return errors.Wrap(err, "environment/kubernetes: failed to update ResourceQuota")
      }
      return nil
  }
+ if !isNotFound(err) {
+     return errors.Wrap(err, "environment/kubernetes: failed to get ResourceQuota")
+ }

  _, err = e.client.CoreV1().ResourceQuotas(ns).Create(ctx, quota, metav1.CreateOptions{})

Apply the same pattern to EnsureLimitRange.

Also applies to: 61-75

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/quota.go` around lines 30 - 44, The ResourceQuota Get
operation in EnsureResourceQuota currently falls through to create when any
error occurs, but it should only attempt creation if the error is specifically a
NotFound error. After the Get call for ResourceQuotas, add a check: if err is
not nil and not a NotFound error (using apierrors.IsNotFound), return the error
immediately instead of proceeding to the Create call. Apply this same pattern to
the EnsureLimitRange method where it performs a similar Get and Create flow on
LimitRanges, ensuring that non-NotFound errors fail fast rather than attempting
the create operation.
environment/kubernetes/storage.go-142-149 (1)

142-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Respect shared DataPVC mode in ResizePVC.

ResizePVC on Line 143-Line 149 misses the same DataPVC short-circuit used by EnsurePVC/DeletePVC. In shared-PVC mode, this path tries resizing a per-server claim that is intentionally not created.

Proposed fix
 func (e *Environment) ResizePVC(ctx context.Context, newSize string) error {
 	cfg := config.Get()
 	if cfg.Kubernetes.StorageMode != config.KubeStoragePVC {
 		return nil
 	}
+	if cfg.Kubernetes.DataPVC != "" {
+		return nil
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/storage.go` around lines 142 - 149, The ResizePVC
method on the Environment type is missing the shared DataPVC mode check that is
present in EnsurePVC and DeletePVC. Add an early return condition in ResizePVC
to check for shared DataPVC mode (using the same pattern as the existing storage
mode check) so that the function exits before attempting to resize a per-server
claim that is not created in shared-PVC mode.
environment/kubernetes/network.go-141-151 (1)

141-151: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update existing Service type and annotations during reconciliation.

On Line 143-Line 151, the update path never assigns existing.Spec.Type = desired.Spec.Type. Switching between nodeport and loadbalancer can leave the Service in the wrong type. Also, annotations are only overwritten in LB mode, so stale LB annotations can persist after switching back to NodePort.

Proposed fix
  // Service exists; update it with the desired spec while preserving
  // existing NodePort assignments where possible.
+ existing.Spec.Type = desired.Spec.Type
  existing.Spec.Selector = desired.Spec.Selector
  existing.Spec.Ports = mergeServicePorts(existing.Spec.Ports, desired.Spec.Ports)
  existing.Labels = labels
+ existing.Annotations = annotations // nil clears stale LB annotations in non-LB mode

  e.log().WithField("service", svcName).Infof("updating %s service for server", svcType)
- if isLB {
-   existing.Annotations = annotations
- }
  _, err = e.client.CoreV1().Services(ns).Update(ctx, existing, metav1.UpdateOptions{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/network.go` around lines 141 - 151, The Service
reconciliation logic is missing the Type assignment and conditionally applies
annotations. To fix this, ensure that existing.Spec.Type is always assigned from
desired.Spec.Type (add this line after the Selector assignment) to properly
handle Service type transitions between NodePort and LoadBalancer. Additionally,
move the annotations assignment outside of the isLB conditional block so that
existing.Annotations = annotations is always applied, preventing stale
annotations from persisting when switching Service types.
environment/kubernetes/stats.go-83-94 (1)

83-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Outbound Kubernetes calls need per-request timeouts.

Do(ctx) and getPod(ctx) currently use the long-lived polling context. If an API call hangs, the polling loop can block indefinitely and stop publishing fresh stats.

💡 Suggested fix
+const statsRequestTimeout = 3 * time.Second
+
 			if !useKubelet {
-				stats, err = e.getMetricsAPIStats(ctx)
+				reqCtx, cancel := context.WithTimeout(ctx, statsRequestTimeout)
+				stats, err = e.getMetricsAPIStats(reqCtx)
+				cancel()
 				if err != nil {
 					if !loggedSource {
 						e.log().WithField("error", err).Info("metrics API unavailable, falling back to kubelet stats")
 					}
 					useKubelet = true
 				}
 			}

 			if useKubelet {
-				stats, err = e.getKubeletPodStats(ctx)
+				reqCtx, cancel := context.WithTimeout(ctx, statsRequestTimeout)
+				stats, err = e.getKubeletPodStats(reqCtx)
+				cancel()
 				if err != nil {
 					if !loggedSource {
 						e.log().WithField("error", err).Warn("kubelet stats also unavailable, resource metrics will not be reported")
 						loggedSource = true
 					}
 				}
 			}
...
-			pod, err := e.getPod(ctx)
+			reqCtx, cancel := context.WithTimeout(ctx, statsRequestTimeout)
+			pod, err := e.getPod(reqCtx)
+			cancel()

Also applies to: 120-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/stats.go` around lines 83 - 94, The methods
getMetricsAPIStats(ctx), getKubeletPodStats(ctx), and getPod(ctx) are using the
long-lived polling context directly, which means if any individual API call
hangs, the entire polling loop blocks indefinitely and stops publishing fresh
stats. Create a new context with a per-request timeout for each individual API
call and pass that timeout context to these methods instead of the long-lived
polling context. This applies to all three methods: getMetricsAPIStats,
getKubeletPodStats, and getPod.
environment/kubernetes/stats.go-82-99 (1)

82-99: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fallback source becomes sticky after a single transient Metrics API error.

After the first Metrics API failure, useKubelet is set to true and never retried. A transient blip can permanently disable Metrics API usage for the rest of the session.

💡 Suggested fix
 			if !useKubelet {
 				stats, err = e.getMetricsAPIStats(ctx)
 				if err != nil {
 					if !loggedSource {
 						e.log().WithField("error", err).Info("metrics API unavailable, falling back to kubelet stats")
 					}
 					useKubelet = true
 				}
 			}

 			if useKubelet {
 				stats, err = e.getKubeletPodStats(ctx)
 				if err != nil {
 					if !loggedSource {
 						e.log().WithField("error", err).Warn("kubelet stats also unavailable, resource metrics will not be reported")
 						loggedSource = true
 					}
+					// Retry Metrics API on next tick in case the original failure was transient.
+					useKubelet = false
 				}
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/stats.go` around lines 82 - 99, The issue is that the
`useKubelet` flag is set to `true` after the first Metrics API failure and is
never reset, causing the code to permanently skip getMetricsAPIStats and always
use getKubeletPodStats even if the Metrics API becomes available again. To fix
this, ensure that the `useKubelet` flag is reset at the beginning of each
iteration or loop cycle so that the Metrics API is retried on each attempt,
allowing the code to recover from transient failures. Alternatively, restructure
the logic to always attempt getMetricsAPIStats first without relying on a
persistent flag, and only fall back to getKubeletPodStats if that attempt fails.
🟡 Minor comments (1)
chart/pelican-wings/README.md-36-37 (1)

36-37: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

networkMode options are incomplete in the Key Values table.

Line 36 documents only hostport and nodeport, but chart values also support loadbalancer. This can mislead operators during setup.

Suggested README update
-| `wings.kubernetes.networkMode` | Port exposure: `hostport` or `nodeport` | `nodeport` |
+| `wings.kubernetes.networkMode` | Port exposure: `hostport`, `nodeport`, or `loadbalancer` | `nodeport` |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/README.md` around lines 36 - 37, The documentation table
for `wings.kubernetes.networkMode` is incomplete and currently only lists
`hostport` and `nodeport` as port exposure options, but the chart also supports
`loadbalancer`. Update the `wings.kubernetes.networkMode` row in the Key Values
table to include all three supported options (hostport, nodeport, and
loadbalancer) so operators have accurate information about available network
modes during setup.
🧹 Nitpick comments (7)
environment/kubernetes/environment_test.go (1)

509-566: ⚡ Quick win

Add a duplicate-allocation regression test for buildContainerPorts.

Current cases don’t cover repeated port numbers across different allocation IPs, which is a high-risk edge for Pod spec generation. A focused test here will prevent regressions once deduping logic is added.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/environment_test.go` around lines 509 - 566, The
buildContainerPorts test suite in the "buildContainerPorts" describe block is
missing a regression test for duplicate port allocations across different IPs.
Add a new g.It test case that creates an environment.Allocations with the same
port number mapped to multiple different IP addresses, then verify that
buildContainerPorts correctly deduplicates these ports when generating the
container port configuration. This will ensure the deduplication logic handles
this high-risk edge case correctly.
chart/pelican-wings/values.yaml (1)

127-131: ⚡ Quick win

Harden default pod/container security context values.

Line 128 and Line 131 default to {}, so the chart ships permissive security defaults. Set secure baseline defaults and keep overrides available for compatibility.

🔐 Suggested default hardening values
-podSecurityContext: {}
+podSecurityContext:
+  runAsNonRoot: true
+  fsGroup: 1000

-securityContext: {}
+securityContext:
+  allowPrivilegeEscalation: false
+  readOnlyRootFilesystem: true
+  capabilities:
+    drop:
+      - ALL
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/values.yaml` around lines 127 - 131, The
podSecurityContext and securityContext fields are currently set to empty
dictionaries, which results in the chart shipping with permissive security
defaults that do not enforce any security constraints. Replace the empty
dictionary values for both podSecurityContext and securityContext with secure
baseline defaults that implement hardening measures such as running as non-root
users, setting appropriate capabilities, disabling privilege escalation, and
enforcing read-only root filesystems. Ensure these values remain overridable in
the chart to maintain compatibility with deployments that may have specific
security requirements.

Source: Linters/SAST tools

environment/kubernetes/installer_test.go (1)

183-185: ⚡ Quick win

Assert the cancellation error explicitly in the Run cancellation test.

The test currently does <-errCh without checking value, so non-cancellation errors can pass silently.

Suggested change
-				cancel()
-				<-errCh
+				cancel()
+				err = <-errCh
+				g.Assert(err).Equal(context.Canceled)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/installer_test.go` around lines 183 - 185, In the Run
cancellation test, after canceling the context and receiving the error from
errCh, explicitly assert that the received error is a context cancellation error
rather than just consuming the value without validation. Check that the error
equals context.Canceled to ensure the test properly validates the expected
cancellation behavior and prevents non-cancellation errors from passing
silently.
system/system.go (1)

113-116: ⚡ Quick win

Make osrelease.Read() best-effort in Kubernetes mode.

Lines 113-116 return immediately on read failure, so the Line 124-131 fallback (NAME/GOOS) is never reached. This can fail system-info endpoints on minimal images unnecessarily.

Also applies to: 124-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@system/system.go` around lines 113 - 116, The error from osrelease.Read()
causes an immediate return, preventing the fallback code that uses NAME and GOOS
from being reached. Make osrelease.Read() best-effort by removing the early
return on error and instead allowing the function to continue to the fallback
logic (the code at lines 124-131 that constructs the response using NAME and
GOOS values). This way, if osrelease.Read() fails on minimal Kubernetes images,
the system will still provide system-info via the fallback mechanism rather than
failing entirely.
environment/kubernetes/network_test.go (1)

92-130: ⚡ Quick win

Add a regression test for Service type transitions on update.

The update case only validates port count. It doesn’t assert behavior when mode changes (e.g., existing NodePort service updated under LoadBalancer mode), which is a key reconciliation path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/network_test.go` around lines 92 - 130, The test case
"should update an existing service" only validates the port count after the
update but doesn't assert that the Service type is correctly updated when the
network mode configuration changes. Add an assertion to verify that
svc.Spec.Type matches the expected type based on the configured network mode
(KubeNetworkNodePort should result in ServiceTypeNodePort), or add a separate
regression test case that explicitly tests a service type transition scenario
where an existing service of one type (e.g., NodePort) is updated under a
different mode configuration (e.g., LoadBalancer) to ensure the service type is
correctly reconciled.
environment/kubernetes/storage_test.go (1)

215-284: ⚡ Quick win

Add ResizePVC coverage for shared DataPVC mode.

Please add a case where StorageMode is pvc and DataPVC is non-empty, and assert ResizePVC is a no-op. This keeps shared-PVC semantics consistent with EnsurePVC/DeletePVC.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/storage_test.go` around lines 215 - 284, Add a new
test case within the ResizePVC describe block that tests the shared DataPVC mode
scenario. The test should create an Environment with a Kubernetes client, set
the config's StorageMode to config.KubeStoragePVC and set a non-empty DataPVC
value (to indicate shared PVC mode), then call env.ResizePVC and assert that it
returns nil (is a no-op). This ensures ResizePVC behavior is consistent with
EnsurePVC and DeletePVC when operating in shared DataPVC mode.
environment/kubernetes/quota_test.go (1)

259-309: ⚡ Quick win

Add regression tests for invalid quantity and Get-failure paths.

Please add tests that cover:

  • invalid quantity strings in quota/limitrange config (expect returned error, not panic),
  • non-NotFound failures from Get in EnsureResourceQuota/EnsureLimitRange (expect immediate error).

These are currently untested high-risk branches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@environment/kubernetes/quota_test.go` around lines 259 - 309, Add regression
test cases for error handling paths in the quota and limitrange tests. Within
the existing "buildResourceQuota" Describe block, add a test case that passes an
invalid quantity string (e.g., "invalid") to the CPULimit field and verifies
that an error is returned rather than a panic. Similarly, within the
"buildLimitRange" Describe block, add a test case with invalid quantity strings
for DefaultCPULimit or MaxCPU fields and verify error handling. Additionally,
add test cases for the EnsureResourceQuota and EnsureLimitRange functions that
mock the Get operation to return errors other than NotFound (for example, a
permission denied or internal server error), and verify that these errors are
immediately returned to the caller rather than being ignored or retried.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 30a4dac6-e075-4e39-a906-262d00b4493f

📥 Commits

Reviewing files that changed from the base of the PR and between 58c5717 and eaf734e.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (53)
  • .github/workflows/helm-lint.yaml
  • chart/pelican-wings/Chart.yaml
  • chart/pelican-wings/README.md
  • chart/pelican-wings/templates/NOTES.txt
  • chart/pelican-wings/templates/_helpers.tpl
  • chart/pelican-wings/templates/clusterrole-metrics.yaml
  • chart/pelican-wings/templates/clusterrolebinding-metrics.yaml
  • chart/pelican-wings/templates/configmap.yaml
  • chart/pelican-wings/templates/deployment.yaml
  • chart/pelican-wings/templates/limitrange.yaml
  • chart/pelican-wings/templates/namespace.yaml
  • chart/pelican-wings/templates/pvc.yaml
  • chart/pelican-wings/templates/resourcequota.yaml
  • chart/pelican-wings/templates/role.yaml
  • chart/pelican-wings/templates/rolebinding.yaml
  • chart/pelican-wings/templates/service.yaml
  • chart/pelican-wings/templates/serviceaccount.yaml
  • chart/pelican-wings/values.yaml
  • config/config.go
  • config/config_kubernetes.go
  • environment/kubernetes/api.go
  • environment/kubernetes/container.go
  • environment/kubernetes/container_test.go
  • environment/kubernetes/environment.go
  • environment/kubernetes/environment_test.go
  • environment/kubernetes/identity.go
  • environment/kubernetes/installer.go
  • environment/kubernetes/installer_test.go
  • environment/kubernetes/network.go
  • environment/kubernetes/network_test.go
  • environment/kubernetes/node.go
  • environment/kubernetes/power.go
  • environment/kubernetes/power_test.go
  • environment/kubernetes/quota.go
  • environment/kubernetes/quota_test.go
  • environment/kubernetes/stats.go
  • environment/kubernetes/stats_test.go
  • environment/kubernetes/storage.go
  • environment/kubernetes/storage_test.go
  • environment/kubernetes/testutil_test.go
  • go.mod
  • kubernetes/README.md
  • kubernetes/clusterrole-metrics.yaml
  • kubernetes/clusterrolebinding-metrics.yaml
  • kubernetes/namespace.yaml
  • kubernetes/role.yaml
  • kubernetes/rolebinding.yaml
  • kubernetes/serviceaccount.yaml
  • router/router_system.go
  • server/install.go
  • server/install_kubernetes.go
  • server/manager.go
  • system/system.go

Comment thread environment/kubernetes/api.go Outdated
Comment thread environment/kubernetes/container.go Outdated
Comment thread environment/kubernetes/quota.go Outdated
Address three robustness issues in the Kubernetes backend:

- api.go: Client() stored the init error in a function-local variable, so
  after a failed first call the sync.Once never re-ran and subsequent calls
  returned (nil, nil), masking the failure. Persist the error in a package
  variable and return it on every call.
- container.go: SendCommand() checked IsAttached() before taking the lock
  that guards e.stream, so the stream could be cleared in between and cause
  a nil-pointer write. Nil-check e.stream under the read lock instead.
- quota.go: buildResourceQuota/buildLimitRange used resource.MustParse on
  user-configurable quantity strings, panicking the process on invalid
  values. Parse with resource.ParseQuantity and propagate errors through
  EnsureResourceQuota/EnsureLimitRange.
@devin-ai-integration devin-ai-integration Bot force-pushed the feat/kubernetes-environment branch from 84fe43d to 3be8dda Compare June 20, 2026 04:50
Exonical added 2 commits June 20, 2026 05:19
Harden the Kubernetes environment and Helm chart per the full CodeRabbit
review:

environment/kubernetes:
- container.go: surface Exists() errors in Create(), order Destroy() offline
  state after the delete check, use apierrors.IsNotFound, and deduplicate
  container ports shared across allocation IPs.
- power.go: handle delete/wait errors in OnBeforeStart, give signal-based
  stops a graceful SIGTERM grace period (immediate for SIGKILL), and treat a
  terminal (non-running) Pod as stopped so restart-while-off does not hang.
- installer.go: propagate caller context into WriteInstallScript, use
  foreground propagation + wait when recreating the installer Job, and reject
  SubPath values that escape the root directory.
- quota.go: fail fast on non-NotFound Get errors instead of falling through
  to Create.
- storage.go: ResizePVC is a no-op when a shared DataPVC is configured.
- network.go: reconcile Service type and annotations on a networkMode switch.
- stats.go: bound each metrics/stats request with a timeout and retry the
  metrics API every tick instead of sticking to the kubelet fallback.

server/router/system:
- update.go: sync image and stop configuration to the Kubernetes environment.
- router_system.go: return an error when no assignable IPs are discovered.
- system.go: treat a missing os-release as best-effort in Kubernetes mode.

chart/pelican-wings:
- Render credentials into a Secret instead of a ConfigMap.
- Guard wings.kubernetes.namespace against gameNamespace drift.
- Fail rendering when ResourceQuota/LimitRange are enabled but empty.
- Require an explicit serviceAccount.name when not creating one.
- Gate the broad nodes/proxy kubelet fallback behind rbac.kubeletMetricsFallback.

docs/manifests:
- Split the optional nodes/proxy ClusterRole into clusterrole-metrics-kubelet.yaml.
- Expand the chart and kubernetes RBAC READMEs (credentials, networkMode,
  multi-instance guidance, RBAC tables).

Add regression tests for port deduplication, Service type reconciliation,
ResizePVC with a shared DataPVC, and quota Get-failure fail-fast.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
chart/pelican-wings/README.md (1)

30-45: ⚠️ Potential issue | 🟡 Minor

Add resourceQuota and limitRange parameters to the "Key Values" documentation table.

The documented configuration keys match values.yaml, but the README's "Key Values" section is missing the ResourceQuota and LimitRange parameters introduced in this PR. Both resourceQuota (with enabled, cpuLimit, memoryLimit) and limitRange (with enabled, defaultCPULimit, defaultMemoryLimit) are defined in values.yaml but not documented in the table. These should be added for completeness since they control resource enforcement in the game server namespace.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/README.md` around lines 30 - 45, The README's "Key
Values" documentation table is missing the resourceQuota and limitRange
configuration parameters that are defined in values.yaml. Add two new parameter
entries to the documentation table: first for resourceQuota with sub-parameters
enabled, cpuLimit, and memoryLimit that control resource quotas in the game
server namespace, and second for limitRange with sub-parameters enabled,
defaultCPULimit, and defaultMemoryLimit that control default resource limits.
Follow the same table format as the existing parameter entries, providing clear
descriptions and their default values from values.yaml.
🧹 Nitpick comments (1)
chart/pelican-wings/templates/configmap.yaml (1)

1-12: 💤 Low value

Consider renaming file to secret.yaml for clarity.

The template creates a Secret resource but is named configmap.yaml. While functionally correct (and good that credentials are stored as a Secret), the filename mismatch could confuse maintainers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/pelican-wings/templates/configmap.yaml` around lines 1 - 12, The file
is named configmap.yaml but defines a Kubernetes Secret resource (kind: Secret),
creating a naming mismatch that could confuse maintainers. Rename the file from
configmap.yaml to secret.yaml to accurately reflect the resource type it
contains.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@chart/pelican-wings/README.md`:
- Around line 30-45: The README's "Key Values" documentation table is missing
the resourceQuota and limitRange configuration parameters that are defined in
values.yaml. Add two new parameter entries to the documentation table: first for
resourceQuota with sub-parameters enabled, cpuLimit, and memoryLimit that
control resource quotas in the game server namespace, and second for limitRange
with sub-parameters enabled, defaultCPULimit, and defaultMemoryLimit that
control default resource limits. Follow the same table format as the existing
parameter entries, providing clear descriptions and their default values from
values.yaml.

---

Nitpick comments:
In `@chart/pelican-wings/templates/configmap.yaml`:
- Around line 1-12: The file is named configmap.yaml but defines a Kubernetes
Secret resource (kind: Secret), creating a naming mismatch that could confuse
maintainers. Rename the file from configmap.yaml to secret.yaml to accurately
reflect the resource type it contains.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bf4caab9-0551-4889-b062-1c9e2c53fc91

📥 Commits

Reviewing files that changed from the base of the PR and between 3be8dda and 6365175.

📒 Files selected for processing (28)
  • .github/workflows/helm-lint.yaml
  • chart/pelican-wings/README.md
  • chart/pelican-wings/templates/_helpers.tpl
  • chart/pelican-wings/templates/clusterrole-metrics.yaml
  • chart/pelican-wings/templates/configmap.yaml
  • chart/pelican-wings/templates/deployment.yaml
  • chart/pelican-wings/templates/limitrange.yaml
  • chart/pelican-wings/templates/resourcequota.yaml
  • chart/pelican-wings/values.yaml
  • environment/kubernetes/container.go
  • environment/kubernetes/environment_test.go
  • environment/kubernetes/installer.go
  • environment/kubernetes/installer_test.go
  • environment/kubernetes/network.go
  • environment/kubernetes/network_test.go
  • environment/kubernetes/power.go
  • environment/kubernetes/quota.go
  • environment/kubernetes/quota_test.go
  • environment/kubernetes/stats.go
  • environment/kubernetes/storage.go
  • environment/kubernetes/storage_test.go
  • kubernetes/README.md
  • kubernetes/clusterrole-metrics-kubelet.yaml
  • kubernetes/clusterrole-metrics.yaml
  • router/router_system.go
  • server/install_kubernetes.go
  • server/update.go
  • system/system.go
✅ Files skipped from review due to trivial changes (1)
  • kubernetes/README.md
🚧 Files skipped from review as they are similar to previous changes (19)
  • chart/pelican-wings/templates/limitrange.yaml
  • chart/pelican-wings/templates/clusterrole-metrics.yaml
  • .github/workflows/helm-lint.yaml
  • chart/pelican-wings/templates/deployment.yaml
  • chart/pelican-wings/values.yaml
  • environment/kubernetes/storage_test.go
  • environment/kubernetes/environment_test.go
  • environment/kubernetes/quota.go
  • environment/kubernetes/quota_test.go
  • environment/kubernetes/installer_test.go
  • server/install_kubernetes.go
  • system/system.go
  • environment/kubernetes/storage.go
  • environment/kubernetes/network_test.go
  • router/router_system.go
  • environment/kubernetes/network.go
  • environment/kubernetes/power.go
  • environment/kubernetes/stats.go
  • environment/kubernetes/container.go
📜 Review details
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-03-02T13:53:08.995Z
Learnt from: parkervcp
Repo: pelican-dev/wings PR: 171
File: server/power.go:190-203
Timestamp: 2026-03-02T13:53:08.995Z
Learning: In the server package, when quotas are enabled via config.Get().System.Quotas.Enabled, the disk space check using used >= s.DiskSpace() does not require a special guard for unlimited-disk scenarios (DiskSpace() <= 0). The filesystem handles such cases, so the existing check is sufficient. Apply this pattern to similar quota-related disk checks in the server package and ensure tests/docs reflect that unlimited-disk behavior is governed by the filesystem, not by an extra guard in code.

Applied to files:

  • server/update.go
🪛 markdownlint-cli2 (0.22.1)
chart/pelican-wings/README.md

[warning] 51-51: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🪛 Trivy (0.69.3)
kubernetes/clusterrole-metrics-kubelet.yaml

[error] 21-23: Do not allow privilege escalation from node proxy

Role permits privilege escalation from node proxy

Rule: KSV-0047

Learn more

(IaC/Kubernetes)

🪛 YAMLlint (1.37.1)
chart/pelican-wings/templates/configmap.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (11)
chart/pelican-wings/templates/_helpers.tpl (1)

54-61: LGTM!

chart/pelican-wings/templates/configmap.yaml (1)

13-80: LGTM!

chart/pelican-wings/templates/resourcequota.yaml (1)

1-35: LGTM!

kubernetes/clusterrole-metrics.yaml (1)

1-24: LGTM!

kubernetes/clusterrole-metrics-kubelet.yaml (1)

1-40: Trivy KSV-0047 acknowledged: nodes/proxy privilege escalation risk is documented and opt-in.

The static analysis correctly flags nodes/proxy as a privilege escalation vector. However, this is intentionally documented in the header comments (lines 1-10) as an optional, explicitly opted-into fallback for clusters without metrics-server. The separation from the default clusterrole-metrics.yaml follows the principle of least privilege.

The hardcoded pelican-wings / pelican references in the binding (lines 34-35) are appropriate for static manifests, and line 10 instructs users to adjust them.

chart/pelican-wings/README.md (1)

47-55: 💤 Low value

Fix markdown lint warning: blank line between blockquotes (MD028).

Markdownlint flags a blank line at line 51 separating two blockquotes. To comply with the no-blanks-blockquote rule, remove the blank line or use an inline separator. The visual appearance remains clear either way.

♻️ Proposed fix (option: join with no blank line)
 > **Namespace:** Wings schedules game-server workloads into `gameNamespace`, and
 > the chart creates the namespaced RBAC there. `wings.kubernetes.namespace` is
 > therefore derived from `gameNamespace`; if you set it explicitly it must match
 > `gameNamespace` or the chart will fail to render.
->
 > **Credentials:** `wings.token`, `wings.tokenId`, and `wings.uuid` are rendered
 > into a Kubernetes **Secret** (not a ConfigMap). Supply them via a private
 > values file or `--set`, e.g. `helm install ... -f my-creds.yaml`, and keep that
 > file out of version control.

Source: Linters/SAST tools

environment/kubernetes/installer.go (4)

64-189: LGTM!


293-319: LGTM!


372-391: LGTM!


393-419: LGTM!

server/update.go (1)

7-7: LGTM!

Also applies to: 43-49

… test

- server/manager.go: initialize Kubernetes Metadata with the server stop
  configuration so command/signal stop behavior is correct from first boot.
- kubernetes/README.md: sync the namespace-scoped RBAC verb table with role.yaml
  (pods watch, services list, configmaps update).
- chart README: use a local values file for credentials in the quick-start
  instead of leaking them via --set.
- installer_test.go: assert the Run cancellation path returns context.Canceled.
@parkervcp

Copy link
Copy Markdown
Member

This is going to take some time to process and work through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants