feat: add Kubernetes environment (run game servers as Pods)#193
feat: add Kubernetes environment (run game servers as Pods)#193Exonical wants to merge 6 commits into
Conversation
Adds an opt-in Kubernetes backend alongside Docker, gated behind kubernetes.enabled (default false). A new environment/kubernetes package implements the ProcessEnvironment interface: Pod lifecycle/power, SPDY attach/exec console, hostport/nodeport/loadbalancer networking, hostpath/PVC storage, optional ResourceQuota/LimitRange, metrics-API stats, and egg installation via a Job + ConfigMap. Environment selection is gated in server/manager.go and server/install.go; system info and /api/system/ips become backend-aware in system/system.go and router/router_system.go. Docker behavior is unchanged when kubernetes.enabled is false. Ships raw RBAC manifests under kubernetes/ and a Helm chart under chart/pelican-wings/ for deploying Wings in-cluster, plus a helm-lint workflow. The Kubernetes client libraries (k8s.io/* v0.36.1) require Go >= 1.26.0, so the go directive is bumped accordingly.
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (55)
📝 WalkthroughWalkthroughThis PR introduces a full Kubernetes execution backend for Pelican Wings. It adds config types, a Kubernetes ChangesKubernetes Go Runtime Backend
Helm Chart and Static RBAC Manifests
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (29)
environment/kubernetes/container.go-255-259 (1)
255-259:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly transition to offline after successful Pod deletion handling.
At Line 255, state is set to offline before checking whether pod deletion actually failed. On non-NotFound delete errors, control-plane state becomes inaccurate while the pod may still be running.
Proposed fix
- e.SetState(environment.ProcessOfflineState) - if err != nil && !isNotFound(err) { return errors.Wrap(err, "environment/kubernetes: failed to delete pod") } + e.SetState(environment.ProcessOfflineState) return nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/container.go` around lines 255 - 259, The state is being set to ProcessOfflineState before validating the pod deletion outcome, which causes control-plane state to become inaccurate when deletion fails with a non-NotFound error. Move the e.SetState(environment.ProcessOfflineState) call to execute only after the error check succeeds, so that the offline state transition happens only when deletion is successful or when the pod is already gone (NotFound error). This ensures state accuracy by deferring the state change until after confirming the deletion operation completed as expected.environment/kubernetes/container.go-49-50 (1)
49-50:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle
Exists()errors explicitly inCreate().At Line 49, the error from
e.Exists()is discarded. If the API call fails (RBAC/network/etc.),Create()continues and masks the root cause with a secondary create error.Proposed fix
- if exists, _ := e.Exists(); exists { + exists, err := e.Exists() + if err != nil { + return errors.Wrap(err, "environment/kubernetes: failed to check pod existence") + } + if exists { return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/container.go` around lines 49 - 50, In the Create() method, the error returned by e.Exists() is being ignored using a blank identifier. Instead of discarding this error with the underscore, capture it in a variable and explicitly check if it is not nil. If an error occurs during the Exists() call (due to RBAC, network issues, etc.), return that error immediately rather than continuing execution, which would otherwise mask the root cause with a potentially different create error downstream.environment/kubernetes/container.go-562-564 (1)
562-564:⚠️ Potential issue | 🟠 MajorUse typed Kubernetes NotFound checks instead of string matching.
The current implementation at lines 562-563 matches
"not found"viaerr.Error(), which is brittle and can misclassify errors. Useapierrors.IsNotFound(err)fromk8s.io/apimachinery/pkg/api/errorsinstead. This function is widely used across the codebase (13 call sites), so fixing it improves error handling consistency.Add the import:
apierrors "k8s.io/apimachinery/pkg/api/errors"Then update the function:
func isNotFound(err error) bool { return apierrors.IsNotFound(err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/container.go` around lines 562 - 564, The isNotFound function uses brittle string matching to check for Kubernetes not found errors instead of using the proper API errors package. Add the import for apierrors from k8s.io/apimachinery/pkg/api/errors at the top of the file, then replace the string matching logic in the isNotFound function with a call to apierrors.IsNotFound(err) which provides consistent and reliable error type checking across the codebase.environment/kubernetes/container.go-517-542 (1)
517-542:⚠️ Potential issue | 🟠 MajorDeduplicate allocation ports in container spec to avoid Kubernetes update failures.
buildContainerPorts()appends ports per map entry without deduping. When the same port number appears under multiple allocation IPs, duplicate TCP/UDP entries are created. While Kubernetes may accept such duplicates at Pod creation, subsequent update or patch operations frequently fail due to merge key conflicts in the reconciliation logic. This mirrors the deduplication pattern already applied inEnsureService()(network.go:44-51).Proposed fix
func (e *Environment) buildContainerPorts() []corev1.ContainerPort { cfg := config.Get() allocs := e.Configuration.Allocations() var ports []corev1.ContainerPort + seen := make(map[int]struct{}) for _, allocPorts := range allocs.Mappings { for _, port := range allocPorts { if port < 1 || port > 65535 { continue } + if _, ok := seen[port]; ok { + continue + } + seen[port] = struct{}{}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/container.go` around lines 517 - 542, The buildContainerPorts() function creates duplicate TCP/UDP ContainerPort entries when the same port number appears under multiple allocation IPs, which causes Kubernetes update/patch operations to fail. Implement port deduplication in buildContainerPorts() by tracking which ports have already been processed (using a map keyed by port number) before appending tcpPort and udpPort to the ports slice, following the same deduplication pattern already applied in EnsureService() function to ensure each unique port is only added once regardless of how many allocation entries reference it.environment/kubernetes/power.go-91-93 (1)
91-93:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSignal-based stop currently force-kills immediately.
Line 92 calls
Terminate(..., "SIGTERM"), butTerminate(Line 187-190) always usesGracePeriodSeconds=0. This skips graceful shutdown and can cause abrupt process termination.Suggested fix
- // If using a signal-based stop, terminate with that signal. + // If using signal-based stop, map to graceful/force delete semantics. if s.Type == "" || s.Type == remote.ProcessStopSignal { - return e.Terminate(ctx, "SIGTERM") + gracePeriod := int64(30) + if strings.EqualFold(s.Value, "SIGKILL") { + gracePeriod = 0 + } + err := e.client.CoreV1().Pods(e.namespace()).Delete(ctx, e.Id, metav1.DeleteOptions{ + GracePeriodSeconds: &gracePeriod, + }) + if err != nil && !isNotFound(err) { + return errors.Wrap(err, "environment/kubernetes: failed to stop pod") + } + return nil }Also applies to: 169-190
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/power.go` around lines 91 - 93, The issue is that when stopping a process with a signal (when s.Type is empty or equals remote.ProcessStopSignal), the code calls Terminate with "SIGTERM" but the Terminate method always sets GracePeriodSeconds=0, which force-kills the process immediately instead of allowing graceful shutdown. To fix this, modify the Terminate method to accept a grace period parameter (or create a separate graceful termination method) and pass an appropriate grace period value (e.g., 30 seconds) when calling Terminate from the signal-based stop path, ensuring that SIGTERM signals allow time for graceful process shutdown rather than immediate termination.environment/kubernetes/power.go-156-163 (1)
156-163:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWaitForStop should treat non-running Pods as stopped, not only deleted.
After command-based stop, the Pod can remain in
Succeeded/Failedwithout being deleted. Line 157 currently waits only for deletion, so stop/restart can block until timeout and then force terminate.Suggested fix
- // Wait for the Pod to be gone. - if err := e.waitForPodDeletion(tctx, duration); err != nil { + // Wait for the Pod to be gone or no longer running. + if err := e.waitForPodStoppedOrDeleted(tctx, duration); err != nil { if terminate { e.log().Warn("pod did not terminate in time, forcing deletion") return e.Terminate(ctx, "SIGKILL") @@ return nil } + +func (e *Environment) waitForPodStoppedOrDeleted(ctx context.Context, timeout time.Duration) error { + dctx, cancel := context.WithTimeout(ctx, timeout) + defer cancel() + + ticker := time.NewTicker(500 * time.Millisecond) + defer ticker.Stop() + + for { + select { + case <-dctx.Done(): + return dctx.Err() + case <-ticker.C: + pod, err := e.getPod(dctx) + if err != nil && isNotFound(err) { + return nil + } + if err == nil && !isPodRunning(pod) { + e.markOffline() + return nil + } + } + } +}Also applies to: 264-282
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/power.go` around lines 156 - 163, The waitForPodDeletion call in the WaitForStop method only waits for complete pod deletion but does not account for pods that have transitioned to terminal states like Succeeded or Failed. Modify the logic to check not only for pod deletion but also for the pod being in a non-running terminal state (Succeeded/Failed status). The waiting condition should return success when either the pod is deleted OR the pod reaches a terminal non-running state, preventing unnecessary blocking until timeout when the pod has already stopped executing.environment/kubernetes/power.go-23-29 (1)
23-29:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle Pod delete/wait errors explicitly before recreate.
Line 23 and Line 28 currently drop errors; this can mask RBAC/API failures and turn them into opaque
Createfailures later.Suggested fix
gracePeriod := int64(0) - _ = e.client.CoreV1().Pods(e.namespace()).Delete(ctx, e.Id, metav1.DeleteOptions{ + if err := e.client.CoreV1().Pods(e.namespace()).Delete(ctx, e.Id, metav1.DeleteOptions{ GracePeriodSeconds: &gracePeriod, - }) + }); err != nil && !isNotFound(err) { + return errors.Wrap(err, "environment/kubernetes: failed to delete existing pod before start") + } // Wait briefly for deletion to propagate. - e.waitForPodDeletion(ctx, 10*time.Second) + if err := e.waitForPodDeletion(ctx, 10*time.Second); err != nil { + return errors.Wrap(err, "environment/kubernetes: timed out waiting for old pod deletion") + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/power.go` around lines 23 - 29, The Delete call on e.client.CoreV1().Pods(e.namespace()).Delete() and the waitForPodDeletion() method both currently ignore errors by using underscore assignments. Instead of discarding these errors, capture the return values from both the Delete operation and the waitForPodDeletion call, then add explicit error handling to check if either operation returns an error. If an error occurs, handle it appropriately (such as logging and returning early) to prevent masking RBAC or API failures that would otherwise appear later as opaque errors during subsequent Create operations.kubernetes/clusterrole-metrics.yaml-20-24 (1)
20-24:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMake kubelet proxy permissions opt-in.
Granting
nodes/proxy(get) cluster-wide materially increases blast radius; a compromised Wings identity can proxy kubelet endpoints across nodes. Keepmetrics.k8s.io/podsas default and move kubelet proxy access to a separate optional manifest/binding.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kubernetes/clusterrole-metrics.yaml` around lines 20 - 24, Remove the kubelet proxy permission block (the apiGroups, resources, and verbs entries for nodes/proxy) from the clusterrole-metrics.yaml file to make it the default minimal permission set. Create a separate optional manifest file for kubelet proxy access that users can apply when they need to proxy kubelet endpoints across nodes. This keeps the default configuration least-privileged while allowing users to opt-in to the higher blast radius permission if needed.Source: Linters/SAST tools
kubernetes/README.md-112-114 (1)
112-114:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMultiple-instance guidance is incorrect for ClusterRoleBinding usage.
Line 113 says to share the same ClusterRoleBinding across namespaces, but ClusterRoleBinding subjects are namespaced ServiceAccounts. Additional Wings instances in other namespaces need either separate ClusterRoleBindings or additional
subjectsentries.Suggested README wording
-If you run multiple Wings nodes targeting different namespaces, create a Role and -RoleBinding per namespace, but share the ClusterRole/ClusterRoleBinding (it's -namespace-independent). +If you run multiple Wings nodes targeting different namespaces, create a Role and +RoleBinding per namespace and bind each ServiceAccount to the shared ClusterRole +(either one ClusterRoleBinding per ServiceAccount or a single binding with multiple subjects).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kubernetes/README.md` around lines 112 - 114, The guidance in the section about running multiple Wings nodes targeting different namespaces is incorrect regarding ClusterRoleBinding usage. Update lines 112-114 to clarify that ClusterRoleBinding subjects reference namespaced ServiceAccounts, so when running multiple Wings instances in different namespaces, you must either create separate ClusterRoleBindings for each namespace or add multiple subject entries to a single shared ClusterRoleBinding that references the ServiceAccount from each namespace. Remove the incorrect suggestion that a single ClusterRoleBinding can be shared across namespaces without modification.chart/pelican-wings/README.md-15-22 (1)
15-22:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid putting Panel credentials directly in CLI flags.
Using
--set wings.token=...and--set wings.tokenId=...exposes credentials in shell history and process lists. Prefer a local values file (or secret-backed workflow) for the quick-start example.Suggested README update
-helm install wings ./chart/pelican-wings \ - --set wings.panelUrl=https://panel.example.com \ - --set wings.token=YOUR_TOKEN \ - --set wings.tokenId=YOUR_TOKEN_ID \ - --set wings.uuid=YOUR_NODE_UUID +# values.local.yaml (do not commit) +wings: + panelUrl: https://panel.example.com + token: YOUR_TOKEN + tokenId: YOUR_TOKEN_ID + uuid: YOUR_NODE_UUID + +helm install wings ./chart/pelican-wings -f values.local.yaml🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/README.md` around lines 15 - 22, The helm install example exposes sensitive credentials in CLI flags (wings.token and wings.tokenId) which appear in shell history and process lists. Replace the direct --set flags approach with an example using a local values file that keeps credentials separate from the command line invocation. Create a sample values file snippet in the README showing how to define these sensitive parameters (wings.token, wings.tokenId, wings.uuid) securely, then update the helm install command to reference this file using the -f flag instead of exposing the values directly.kubernetes/README.md-34-51 (1)
34-51:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequired-permissions tables are out of sync with the manifests.
The table omits granted permissions (
pods: watch,services: list,configmaps: update) and entirely excludes cluster-scopednodes/nodes/proxyaccess defined inkubernetes/clusterrole-metrics.yaml. This weakens security review accuracy for operators.Suggested README table corrections
-| pods | get, create, delete, list | Game server Pod lifecycle | +| pods | get, create, delete, list, watch | Game server Pod lifecycle | ... -| services | get, create, update, delete | NodePort Service management | +| services | get, create, update, delete, list | NodePort/LB Service management | ... -| configmaps | get, create, delete | Install script storage (multi-node) | +| configmaps | get, create, update, delete | Install script + identity management | ### Cluster-scoped -| Resource | Verbs | Purpose | -|---------------------------------------|-------|---------------------------------| -| pods (metrics.k8s.io/v1beta1) | get | CPU/memory usage polling | +| Resource | Verbs | Purpose | +|---------------------------------------|-------|-------------------------------------------| +| pods (metrics.k8s.io/v1beta1) | get | CPU/memory usage polling | +| nodes | get | Node IP discovery | +| nodes/proxy | get | Kubelet stats fallback (if enabled) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kubernetes/README.md` around lines 34 - 51, Update the Kubernetes RBAC permissions table in the README.md to match the actual manifests defined in kubernetes/clusterrole-metrics.yaml. Add the missing verbs to existing resources: pods (add watch), services (add list), and configmaps (add update) in the namespace-scoped section. Additionally, add a new row in the cluster-scoped section documenting the nodes and nodes/proxy resources with their granted verbs (get for nodes and get for nodes/proxy) to ensure the documentation accurately reflects all permissions granted by the cluster role..github/workflows/helm-lint.yaml-21-21 (1)
21-21:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisable credential persistence in checkout step.
Line 21 should set
persist-credentials: falseto avoid leaving the GitHub token in local git config for later steps.Suggested hardening patch
- - uses: actions/checkout@v7 + - uses: actions/checkout@v7 + with: + persist-credentials: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/helm-lint.yaml at line 21, The actions/checkout@v7 step on line 21 needs to include the persist-credentials parameter set to false to prevent the GitHub token from remaining in the local git configuration after the checkout completes. Add a with block to the actions/checkout@v7 step and configure persist-credentials: false within it to disable credential persistence and improve security.Source: Linters/SAST tools
.github/workflows/helm-lint.yaml-21-23 (1)
21-23:⚠️ Potential issue | 🟠 MajorPin GitHub Actions to immutable commit SHAs.
Lines 21 and 23 use mutable tags (
@v7,@v5), which are a supply-chain risk. Pinning to commit SHAs ensures workflows execute only reviewed code and complies with security hardening standards.Suggested hardening patch
- - uses: actions/checkout@v7 + - uses: actions/checkout@1044a6dea927916f2c38ba5aeffbc0a847b1221a # v7.0.0 - - uses: azure/setup-helm@v5 + - uses: azure/setup-helm@dda3372f752e03dde6b3237bc9431cdc2f7a02a2 # v5.0.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/helm-lint.yaml around lines 21 - 23, Replace the mutable version tags with immutable commit SHAs for the GitHub Actions used in the workflow. For the actions/checkout action (currently pinned to `@v7`) and the azure/setup-helm action (currently pinned to `@v5`), replace the `@v` version tags with their corresponding full commit SHAs. This ensures the workflow executes only the reviewed code and complies with security hardening standards by preventing unexpected changes from action updates.Source: Linters/SAST tools
chart/pelican-wings/templates/configmap.yaml-12-15 (1)
12-15:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not store panel credentials in a ConfigMap.
Line 14 writes
tokenintoconfig.ymlinside a ConfigMap. That exposes sensitive credentials through non-secret config objects.🛡️ Suggested secure approach
-apiVersion: v1 -kind: ConfigMap +apiVersion: v1 +kind: Secret +type: Opaque ... -data: +stringData: config.yml: |Also update
chart/pelican-wings/templates/deployment.yamlvolume source fromconfigMaptosecretforconfig-source.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/templates/configmap.yaml` around lines 12 - 15, The token field in the ConfigMap exposes sensitive credentials through a non-secret configuration object, which is a security risk. Move the sensitive `token` field from the ConfigMap in configmap.yaml to a Kubernetes Secret resource instead. Additionally, update the volume source for `config-source` in the deployment.yaml file to reference the new Secret instead of the ConfigMap, ensuring that sensitive credentials are properly protected as secret objects rather than exposed in plain ConfigMaps.Source: Linters/SAST tools
chart/pelican-wings/templates/configmap.yaml-29-30 (1)
29-30:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrevent namespace drift between chart resources and runtime namespace config.
Line 29 uses
.Values.wings.kubernetes.namespace, while chart resources are rendered into.Values.gameNamespace(Line 5). If those diverge, Wings can target a namespace where RBAC/resources are not provisioned.✅ Suggested contract fix
- namespace: {{ .Values.wings.kubernetes.namespace | quote }} + namespace: {{ .Values.gameNamespace | quote }}+{{- if ne .Values.gameNamespace .Values.wings.kubernetes.namespace }} +{{- fail "gameNamespace and wings.kubernetes.namespace must match" }} +{{- end }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/templates/configmap.yaml` around lines 29 - 30, The namespace configuration in the Wings runtime setup uses `.Values.wings.kubernetes.namespace` on line 29, but chart resources are deployed into `.Values.gameNamespace` (line 5), creating a namespace drift where Wings might target a namespace without the necessary RBAC or resources. Replace the reference from `.Values.wings.kubernetes.namespace` with `.Values.gameNamespace` in the namespace field to ensure runtime and provisioned resources target the same namespace consistently.chart/pelican-wings/templates/limitrange.yaml-1-38 (1)
1-38:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast when LimitRange is enabled with no limits configured.
limitRange.enabled=truecurrently renders even when all limit fields are empty, producing a near-emptyLimitRangeitem. Add a precondition requiring at least one configured field to avoid invalid/no-op installs.Suggested fix
{{- if .Values.limitRange.enabled }} +{{- if not (or .Values.limitRange.defaultCPULimit .Values.limitRange.defaultMemoryLimit .Values.limitRange.defaultCPURequest .Values.limitRange.defaultMemoryRequest .Values.limitRange.maxCPU .Values.limitRange.maxMemory) }} +{{- fail "limitRange.enabled=true requires at least one limitRange.* field" }} +{{- end }} apiVersion: v1 kind: LimitRange🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/templates/limitrange.yaml` around lines 1 - 38, The LimitRange template currently renders whenever limitRange.enabled is true, even if all limit configuration fields are empty, resulting in an invalid or no-op resource. Modify the initial condition check (where `if .Values.limitRange.enabled` is evaluated) to add an additional validation that ensures at least one of the limit fields is configured (defaultCPULimit, defaultMemoryLimit, defaultCPURequest, defaultMemoryRequest, maxCPU, or maxMemory) before rendering the entire LimitRange resource. This will fail fast and prevent creating empty LimitRange objects.chart/pelican-wings/templates/clusterrolebinding-metrics.yaml-9-11 (1)
9-11:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrevent implicit binding to the
defaultServiceAccount.With the current helper fallback,
serviceAccount.create=false+ emptyserviceAccount.namebinds this ClusterRole todefault, unintentionally granting cluster-level node/metrics access to any pod using that SA ingameNamespace. Require an explicit SA name when not creating one.Suggested fix
+{{- if and (not .Values.serviceAccount.create) (not .Values.serviceAccount.name) }} +{{- fail "serviceAccount.name must be set when serviceAccount.create=false" }} +{{- end }} {{- if .Values.rbac.create }} apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/templates/clusterrolebinding-metrics.yaml` around lines 9 - 11, The serviceAccountName helper function in the ClusterRoleBinding has a fallback behavior that implicitly binds to the default ServiceAccount when serviceAccount.create is false and serviceAccount.name is empty, which unintentionally grants cluster-level permissions. Modify the serviceAccountName helper function to require an explicit service account name when serviceAccount.create is set to false, preventing any fallback to the default ServiceAccount and raising an error or validation failure if the name is not explicitly provided in the values configuration.chart/pelican-wings/templates/resourcequota.yaml-1-31 (1)
1-31:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard against empty
spec.hardwhen quota is enabled.If
resourceQuota.enabled=truebut all resource fields are empty, this renders an emptyhardblock and the resource can fail validation at apply time. Add a precondition requiring at least one hard limit field.Suggested fix
{{- if .Values.resourceQuota.enabled }} +{{- if not (or .Values.resourceQuota.cpuLimit .Values.resourceQuota.memoryLimit .Values.resourceQuota.cpuRequest .Values.resourceQuota.memoryRequest .Values.resourceQuota.maxPods .Values.resourceQuota.maxPVCs .Values.resourceQuota.maxStorage) }} +{{- fail "resourceQuota.enabled=true requires at least one resourceQuota.* limit/request field" }} +{{- end }} apiVersion: v1 kind: ResourceQuota🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/templates/resourcequota.yaml` around lines 1 - 31, The ResourceQuota template renders an empty spec.hard block when resourceQuota.enabled is true but all individual resource quota fields (cpuLimit, memoryLimit, cpuRequest, memoryRequest, maxPods, maxPVCs, maxStorage) are disabled or empty, causing Kubernetes validation failures. Modify the outer condition that checks .Values.resourceQuota.enabled to also verify that at least one of the hard limit fields is configured before rendering the entire ResourceQuota resource, preventing the creation of invalid manifests with empty hard specifications.router/router_system.go-111-114 (1)
111-114:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKubernetes IP discovery failures are silently converted into a successful empty response.
When node/LB queries fail, the handler only logs warnings and still returns 200 (Line 142), unlike the non-Kubernetes path that aborts on discovery errors. This can surface as “no assignable IPs” without an actionable API error.
Also applies to: 142-142
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@router/router_system.go` around lines 111 - 114, The kubernetes.GetNodeIPs() call in the handler is logging warnings on errors but still proceeding to return a successful 200 response with empty results, creating inconsistency with the non-Kubernetes path which aborts on discovery failures. Modify the error handling in the block where kubernetes.GetNodeIPs(ctx) is called to return an appropriate error response (likely an HTTP error status) instead of silently continuing when the error occurs, ensuring consistent behavior between the Kubernetes and non-Kubernetes code paths and preventing empty IP lists from being treated as successful responses.server/manager.go-215-217 (1)
215-217:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKubernetes runtime metadata is missing stop configuration.
Line 215 initializes
kubernetes.Metadatawith onlyImage. The Kubernetes metadata contract includes stop settings used by lifecycle logic; omitting it can change shutdown behavior (signal/timeout) versus server config.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/manager.go` around lines 215 - 217, The kubernetes.Metadata initialization is missing stop configuration which controls shutdown behavior. In the metadata struct creation where Image is set, add the stop settings from the server configuration (likely s.Config().Stop or similar) to ensure the Kubernetes metadata includes the complete lifecycle settings including signal and timeout values that match the server config.environment/kubernetes/installer.go-304-310 (1)
304-310:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
WriteInstallScriptignores caller cancellation/timeouts.Line 304 hardcodes
context.Background(), so ConfigMap operations cannot be canceled with the server/install context. This can hang installs during API server/network issues.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/installer.go` around lines 304 - 310, The WriteInstallScript method hardcodes context.Background() for ConfigMap operations, which ignores the caller's context and prevents cancellation or timeout propagation. Replace the hardcoded context.Background() assignment with the appropriate context that's passed to the function through its parameters or receiver, and ensure this context is used in both the Delete and Create calls on the ConfigMaps client to respect the caller's cancellation signals and timeouts.environment/kubernetes/installer.go-374-380 (1)
374-380:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
SubPathmay become invalid when data/root paths diverge.On Line 375,
filepath.Relcan produce values like../...whenSystem.Datais outsideSystem.RootDirectory. Using that asVolumeMount.SubPathmakes the Pod invalid and the installer will not start.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/installer.go` around lines 374 - 380, The relative path calculated by filepath.Rel can contain `..` components when System.Data is located outside System.RootDirectory, which makes it invalid for Kubernetes VolumeMount.SubPath. In the else block where vm.SubPath is assigned the rel value, add validation to check if the relative path contains `..` components. If it does, fall back to using the default path format (filepath.Join("volumes", ip.ServerID)) instead of the invalid relative path, similar to what is done in the error case.environment/kubernetes/installer.go-66-69 (1)
66-69:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDelete-then-create can race on installer Job name reuse.
Line 66 uses background deletion and Line 169 immediately recreates the same Job name. If the old Job is still terminating,
Createcan fail withAlreadyExists, causing transient install failures.Also applies to: 169-170
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/installer.go` around lines 66 - 69, The Delete call at line 66 uses propagationBackground() which causes asynchronous deletion, creating a race condition when the same Job is immediately recreated at line 169. The old Job may still be terminating when Create is called, resulting in AlreadyExists errors. Change the propagation policy from propagationBackground() to a foreground/blocking policy (such as propagationForeground()) to ensure the Job deletion completes before the Delete call returns, allowing the subsequent Create call to safely recreate the Job without conflicts.go.mod-3-3 (1)
3-3:⚠️ Potential issue | 🟠 MajorUpdate Dockerfile and CI workflow Go version to 1.26.0 to match go.mod.
The Dockerfile uses
golang:1.25.7-alpineand at least one CI workflow is hardcoded togo-version: "1.25.7", but go.mod now requires Go 1.26.0. This mismatch can cause build failures and inconsistent behavior across local, CI, and production environments.Current misalignment
go.mod: go 1.26.0 Dockerfile: FROM golang:1.25.7-alpine CI workflow: go-version: "1.25.7"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@go.mod` at line 3, The Go version in go.mod is now 1.26.0 but the Dockerfile and CI workflow still reference the older 1.25.7 version, causing environment mismatches. Update the Dockerfile's FROM statement to use golang:1.26.0-alpine instead of golang:1.25.7-alpine, and update any CI workflow files (such as those containing go-version configuration) to set go-version to "1.26.0" instead of "1.25.7" to ensure consistency across all build environments.environment/kubernetes/quota.go-30-44 (1)
30-44:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly create quota resources on NotFound; fail fast on other Get errors.
In both
EnsureResourceQuotaandEnsureLimitRange, create is attempted after any Get failure. On Line 30-Line 44 and Line 61-Line 75, non-NotFound errors (RBAC, transport, timeout) should return immediately instead of falling through to create.Proposed fix
existing, err := e.client.CoreV1().ResourceQuotas(ns).Get(ctx, quotaName, metav1.GetOptions{}) if err == nil { existing.Spec = quota.Spec _, err = e.client.CoreV1().ResourceQuotas(ns).Update(ctx, existing, metav1.UpdateOptions{}) if err != nil { return errors.Wrap(err, "environment/kubernetes: failed to update ResourceQuota") } return nil } + if !isNotFound(err) { + return errors.Wrap(err, "environment/kubernetes: failed to get ResourceQuota") + } _, err = e.client.CoreV1().ResourceQuotas(ns).Create(ctx, quota, metav1.CreateOptions{})Apply the same pattern to
EnsureLimitRange.Also applies to: 61-75
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/quota.go` around lines 30 - 44, The ResourceQuota Get operation in EnsureResourceQuota currently falls through to create when any error occurs, but it should only attempt creation if the error is specifically a NotFound error. After the Get call for ResourceQuotas, add a check: if err is not nil and not a NotFound error (using apierrors.IsNotFound), return the error immediately instead of proceeding to the Create call. Apply this same pattern to the EnsureLimitRange method where it performs a similar Get and Create flow on LimitRanges, ensuring that non-NotFound errors fail fast rather than attempting the create operation.environment/kubernetes/storage.go-142-149 (1)
142-149:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRespect shared
DataPVCmode inResizePVC.
ResizePVCon Line 143-Line 149 misses the sameDataPVCshort-circuit used byEnsurePVC/DeletePVC. In shared-PVC mode, this path tries resizing a per-server claim that is intentionally not created.Proposed fix
func (e *Environment) ResizePVC(ctx context.Context, newSize string) error { cfg := config.Get() if cfg.Kubernetes.StorageMode != config.KubeStoragePVC { return nil } + if cfg.Kubernetes.DataPVC != "" { + return nil + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/storage.go` around lines 142 - 149, The ResizePVC method on the Environment type is missing the shared DataPVC mode check that is present in EnsurePVC and DeletePVC. Add an early return condition in ResizePVC to check for shared DataPVC mode (using the same pattern as the existing storage mode check) so that the function exits before attempting to resize a per-server claim that is not created in shared-PVC mode.environment/kubernetes/network.go-141-151 (1)
141-151:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate existing Service type and annotations during reconciliation.
On Line 143-Line 151, the update path never assigns
existing.Spec.Type = desired.Spec.Type. Switching betweennodeportandloadbalancercan leave the Service in the wrong type. Also, annotations are only overwritten in LB mode, so stale LB annotations can persist after switching back to NodePort.Proposed fix
// Service exists; update it with the desired spec while preserving // existing NodePort assignments where possible. + existing.Spec.Type = desired.Spec.Type existing.Spec.Selector = desired.Spec.Selector existing.Spec.Ports = mergeServicePorts(existing.Spec.Ports, desired.Spec.Ports) existing.Labels = labels + existing.Annotations = annotations // nil clears stale LB annotations in non-LB mode e.log().WithField("service", svcName).Infof("updating %s service for server", svcType) - if isLB { - existing.Annotations = annotations - } _, err = e.client.CoreV1().Services(ns).Update(ctx, existing, metav1.UpdateOptions{})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/network.go` around lines 141 - 151, The Service reconciliation logic is missing the Type assignment and conditionally applies annotations. To fix this, ensure that existing.Spec.Type is always assigned from desired.Spec.Type (add this line after the Selector assignment) to properly handle Service type transitions between NodePort and LoadBalancer. Additionally, move the annotations assignment outside of the isLB conditional block so that existing.Annotations = annotations is always applied, preventing stale annotations from persisting when switching Service types.environment/kubernetes/stats.go-83-94 (1)
83-94:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOutbound Kubernetes calls need per-request timeouts.
Do(ctx)andgetPod(ctx)currently use the long-lived polling context. If an API call hangs, the polling loop can block indefinitely and stop publishing fresh stats.💡 Suggested fix
+const statsRequestTimeout = 3 * time.Second + if !useKubelet { - stats, err = e.getMetricsAPIStats(ctx) + reqCtx, cancel := context.WithTimeout(ctx, statsRequestTimeout) + stats, err = e.getMetricsAPIStats(reqCtx) + cancel() if err != nil { if !loggedSource { e.log().WithField("error", err).Info("metrics API unavailable, falling back to kubelet stats") } useKubelet = true } } if useKubelet { - stats, err = e.getKubeletPodStats(ctx) + reqCtx, cancel := context.WithTimeout(ctx, statsRequestTimeout) + stats, err = e.getKubeletPodStats(reqCtx) + cancel() if err != nil { if !loggedSource { e.log().WithField("error", err).Warn("kubelet stats also unavailable, resource metrics will not be reported") loggedSource = true } } } ... - pod, err := e.getPod(ctx) + reqCtx, cancel := context.WithTimeout(ctx, statsRequestTimeout) + pod, err := e.getPod(reqCtx) + cancel()Also applies to: 120-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/stats.go` around lines 83 - 94, The methods getMetricsAPIStats(ctx), getKubeletPodStats(ctx), and getPod(ctx) are using the long-lived polling context directly, which means if any individual API call hangs, the entire polling loop blocks indefinitely and stops publishing fresh stats. Create a new context with a per-request timeout for each individual API call and pass that timeout context to these methods instead of the long-lived polling context. This applies to all three methods: getMetricsAPIStats, getKubeletPodStats, and getPod.environment/kubernetes/stats.go-82-99 (1)
82-99:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFallback source becomes sticky after a single transient Metrics API error.
After the first Metrics API failure,
useKubeletis set totrueand never retried. A transient blip can permanently disable Metrics API usage for the rest of the session.💡 Suggested fix
if !useKubelet { stats, err = e.getMetricsAPIStats(ctx) if err != nil { if !loggedSource { e.log().WithField("error", err).Info("metrics API unavailable, falling back to kubelet stats") } useKubelet = true } } if useKubelet { stats, err = e.getKubeletPodStats(ctx) if err != nil { if !loggedSource { e.log().WithField("error", err).Warn("kubelet stats also unavailable, resource metrics will not be reported") loggedSource = true } + // Retry Metrics API on next tick in case the original failure was transient. + useKubelet = false } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/stats.go` around lines 82 - 99, The issue is that the `useKubelet` flag is set to `true` after the first Metrics API failure and is never reset, causing the code to permanently skip getMetricsAPIStats and always use getKubeletPodStats even if the Metrics API becomes available again. To fix this, ensure that the `useKubelet` flag is reset at the beginning of each iteration or loop cycle so that the Metrics API is retried on each attempt, allowing the code to recover from transient failures. Alternatively, restructure the logic to always attempt getMetricsAPIStats first without relying on a persistent flag, and only fall back to getKubeletPodStats if that attempt fails.
🟡 Minor comments (1)
chart/pelican-wings/README.md-36-37 (1)
36-37:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
networkModeoptions are incomplete in the Key Values table.Line 36 documents only
hostportandnodeport, but chart values also supportloadbalancer. This can mislead operators during setup.Suggested README update
-| `wings.kubernetes.networkMode` | Port exposure: `hostport` or `nodeport` | `nodeport` | +| `wings.kubernetes.networkMode` | Port exposure: `hostport`, `nodeport`, or `loadbalancer` | `nodeport` |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/README.md` around lines 36 - 37, The documentation table for `wings.kubernetes.networkMode` is incomplete and currently only lists `hostport` and `nodeport` as port exposure options, but the chart also supports `loadbalancer`. Update the `wings.kubernetes.networkMode` row in the Key Values table to include all three supported options (hostport, nodeport, and loadbalancer) so operators have accurate information about available network modes during setup.
🧹 Nitpick comments (7)
environment/kubernetes/environment_test.go (1)
509-566: ⚡ Quick winAdd a duplicate-allocation regression test for
buildContainerPorts.Current cases don’t cover repeated port numbers across different allocation IPs, which is a high-risk edge for Pod spec generation. A focused test here will prevent regressions once deduping logic is added.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/environment_test.go` around lines 509 - 566, The buildContainerPorts test suite in the "buildContainerPorts" describe block is missing a regression test for duplicate port allocations across different IPs. Add a new g.It test case that creates an environment.Allocations with the same port number mapped to multiple different IP addresses, then verify that buildContainerPorts correctly deduplicates these ports when generating the container port configuration. This will ensure the deduplication logic handles this high-risk edge case correctly.chart/pelican-wings/values.yaml (1)
127-131: ⚡ Quick winHarden default pod/container security context values.
Line 128 and Line 131 default to
{}, so the chart ships permissive security defaults. Set secure baseline defaults and keep overrides available for compatibility.🔐 Suggested default hardening values
-podSecurityContext: {} +podSecurityContext: + runAsNonRoot: true + fsGroup: 1000 -securityContext: {} +securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/values.yaml` around lines 127 - 131, The podSecurityContext and securityContext fields are currently set to empty dictionaries, which results in the chart shipping with permissive security defaults that do not enforce any security constraints. Replace the empty dictionary values for both podSecurityContext and securityContext with secure baseline defaults that implement hardening measures such as running as non-root users, setting appropriate capabilities, disabling privilege escalation, and enforcing read-only root filesystems. Ensure these values remain overridable in the chart to maintain compatibility with deployments that may have specific security requirements.Source: Linters/SAST tools
environment/kubernetes/installer_test.go (1)
183-185: ⚡ Quick winAssert the cancellation error explicitly in the
Runcancellation test.The test currently does
<-errChwithout checking value, so non-cancellation errors can pass silently.Suggested change
- cancel() - <-errCh + cancel() + err = <-errCh + g.Assert(err).Equal(context.Canceled)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/installer_test.go` around lines 183 - 185, In the Run cancellation test, after canceling the context and receiving the error from errCh, explicitly assert that the received error is a context cancellation error rather than just consuming the value without validation. Check that the error equals context.Canceled to ensure the test properly validates the expected cancellation behavior and prevents non-cancellation errors from passing silently.system/system.go (1)
113-116: ⚡ Quick winMake
osrelease.Read()best-effort in Kubernetes mode.Lines 113-116 return immediately on read failure, so the Line 124-131 fallback (
NAME/GOOS) is never reached. This can fail system-info endpoints on minimal images unnecessarily.Also applies to: 124-131
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@system/system.go` around lines 113 - 116, The error from osrelease.Read() causes an immediate return, preventing the fallback code that uses NAME and GOOS from being reached. Make osrelease.Read() best-effort by removing the early return on error and instead allowing the function to continue to the fallback logic (the code at lines 124-131 that constructs the response using NAME and GOOS values). This way, if osrelease.Read() fails on minimal Kubernetes images, the system will still provide system-info via the fallback mechanism rather than failing entirely.environment/kubernetes/network_test.go (1)
92-130: ⚡ Quick winAdd a regression test for Service type transitions on update.
The update case only validates port count. It doesn’t assert behavior when mode changes (e.g., existing NodePort service updated under LoadBalancer mode), which is a key reconciliation path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/network_test.go` around lines 92 - 130, The test case "should update an existing service" only validates the port count after the update but doesn't assert that the Service type is correctly updated when the network mode configuration changes. Add an assertion to verify that svc.Spec.Type matches the expected type based on the configured network mode (KubeNetworkNodePort should result in ServiceTypeNodePort), or add a separate regression test case that explicitly tests a service type transition scenario where an existing service of one type (e.g., NodePort) is updated under a different mode configuration (e.g., LoadBalancer) to ensure the service type is correctly reconciled.environment/kubernetes/storage_test.go (1)
215-284: ⚡ Quick winAdd
ResizePVCcoverage for sharedDataPVCmode.Please add a case where
StorageModeispvcandDataPVCis non-empty, and assertResizePVCis a no-op. This keeps shared-PVC semantics consistent withEnsurePVC/DeletePVC.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/storage_test.go` around lines 215 - 284, Add a new test case within the ResizePVC describe block that tests the shared DataPVC mode scenario. The test should create an Environment with a Kubernetes client, set the config's StorageMode to config.KubeStoragePVC and set a non-empty DataPVC value (to indicate shared PVC mode), then call env.ResizePVC and assert that it returns nil (is a no-op). This ensures ResizePVC behavior is consistent with EnsurePVC and DeletePVC when operating in shared DataPVC mode.environment/kubernetes/quota_test.go (1)
259-309: ⚡ Quick winAdd regression tests for invalid quantity and Get-failure paths.
Please add tests that cover:
- invalid quantity strings in quota/limitrange config (expect returned error, not panic),
- non-NotFound failures from Get in
EnsureResourceQuota/EnsureLimitRange(expect immediate error).These are currently untested high-risk branches.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@environment/kubernetes/quota_test.go` around lines 259 - 309, Add regression test cases for error handling paths in the quota and limitrange tests. Within the existing "buildResourceQuota" Describe block, add a test case that passes an invalid quantity string (e.g., "invalid") to the CPULimit field and verifies that an error is returned rather than a panic. Similarly, within the "buildLimitRange" Describe block, add a test case with invalid quantity strings for DefaultCPULimit or MaxCPU fields and verify error handling. Additionally, add test cases for the EnsureResourceQuota and EnsureLimitRange functions that mock the Get operation to return errors other than NotFound (for example, a permission denied or internal server error), and verify that these errors are immediately returned to the caller rather than being ignored or retried.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 30a4dac6-e075-4e39-a906-262d00b4493f
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (53)
.github/workflows/helm-lint.yamlchart/pelican-wings/Chart.yamlchart/pelican-wings/README.mdchart/pelican-wings/templates/NOTES.txtchart/pelican-wings/templates/_helpers.tplchart/pelican-wings/templates/clusterrole-metrics.yamlchart/pelican-wings/templates/clusterrolebinding-metrics.yamlchart/pelican-wings/templates/configmap.yamlchart/pelican-wings/templates/deployment.yamlchart/pelican-wings/templates/limitrange.yamlchart/pelican-wings/templates/namespace.yamlchart/pelican-wings/templates/pvc.yamlchart/pelican-wings/templates/resourcequota.yamlchart/pelican-wings/templates/role.yamlchart/pelican-wings/templates/rolebinding.yamlchart/pelican-wings/templates/service.yamlchart/pelican-wings/templates/serviceaccount.yamlchart/pelican-wings/values.yamlconfig/config.goconfig/config_kubernetes.goenvironment/kubernetes/api.goenvironment/kubernetes/container.goenvironment/kubernetes/container_test.goenvironment/kubernetes/environment.goenvironment/kubernetes/environment_test.goenvironment/kubernetes/identity.goenvironment/kubernetes/installer.goenvironment/kubernetes/installer_test.goenvironment/kubernetes/network.goenvironment/kubernetes/network_test.goenvironment/kubernetes/node.goenvironment/kubernetes/power.goenvironment/kubernetes/power_test.goenvironment/kubernetes/quota.goenvironment/kubernetes/quota_test.goenvironment/kubernetes/stats.goenvironment/kubernetes/stats_test.goenvironment/kubernetes/storage.goenvironment/kubernetes/storage_test.goenvironment/kubernetes/testutil_test.gogo.modkubernetes/README.mdkubernetes/clusterrole-metrics.yamlkubernetes/clusterrolebinding-metrics.yamlkubernetes/namespace.yamlkubernetes/role.yamlkubernetes/rolebinding.yamlkubernetes/serviceaccount.yamlrouter/router_system.goserver/install.goserver/install_kubernetes.goserver/manager.gosystem/system.go
Address three robustness issues in the Kubernetes backend: - api.go: Client() stored the init error in a function-local variable, so after a failed first call the sync.Once never re-ran and subsequent calls returned (nil, nil), masking the failure. Persist the error in a package variable and return it on every call. - container.go: SendCommand() checked IsAttached() before taking the lock that guards e.stream, so the stream could be cleared in between and cause a nil-pointer write. Nil-check e.stream under the read lock instead. - quota.go: buildResourceQuota/buildLimitRange used resource.MustParse on user-configurable quantity strings, panicking the process on invalid values. Parse with resource.ParseQuantity and propagate errors through EnsureResourceQuota/EnsureLimitRange.
84fe43d to
3be8dda
Compare
Harden the Kubernetes environment and Helm chart per the full CodeRabbit review: environment/kubernetes: - container.go: surface Exists() errors in Create(), order Destroy() offline state after the delete check, use apierrors.IsNotFound, and deduplicate container ports shared across allocation IPs. - power.go: handle delete/wait errors in OnBeforeStart, give signal-based stops a graceful SIGTERM grace period (immediate for SIGKILL), and treat a terminal (non-running) Pod as stopped so restart-while-off does not hang. - installer.go: propagate caller context into WriteInstallScript, use foreground propagation + wait when recreating the installer Job, and reject SubPath values that escape the root directory. - quota.go: fail fast on non-NotFound Get errors instead of falling through to Create. - storage.go: ResizePVC is a no-op when a shared DataPVC is configured. - network.go: reconcile Service type and annotations on a networkMode switch. - stats.go: bound each metrics/stats request with a timeout and retry the metrics API every tick instead of sticking to the kubelet fallback. server/router/system: - update.go: sync image and stop configuration to the Kubernetes environment. - router_system.go: return an error when no assignable IPs are discovered. - system.go: treat a missing os-release as best-effort in Kubernetes mode. chart/pelican-wings: - Render credentials into a Secret instead of a ConfigMap. - Guard wings.kubernetes.namespace against gameNamespace drift. - Fail rendering when ResourceQuota/LimitRange are enabled but empty. - Require an explicit serviceAccount.name when not creating one. - Gate the broad nodes/proxy kubelet fallback behind rbac.kubeletMetricsFallback. docs/manifests: - Split the optional nodes/proxy ClusterRole into clusterrole-metrics-kubelet.yaml. - Expand the chart and kubernetes RBAC READMEs (credentials, networkMode, multi-instance guidance, RBAC tables). Add regression tests for port deduplication, Service type reconciliation, ResizePVC with a shared DataPVC, and quota Get-failure fail-fast.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
chart/pelican-wings/README.md (1)
30-45:⚠️ Potential issue | 🟡 MinorAdd resourceQuota and limitRange parameters to the "Key Values" documentation table.
The documented configuration keys match
values.yaml, but the README's "Key Values" section is missing the ResourceQuota and LimitRange parameters introduced in this PR. BothresourceQuota(withenabled,cpuLimit,memoryLimit) andlimitRange(withenabled,defaultCPULimit,defaultMemoryLimit) are defined invalues.yamlbut not documented in the table. These should be added for completeness since they control resource enforcement in the game server namespace.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/README.md` around lines 30 - 45, The README's "Key Values" documentation table is missing the resourceQuota and limitRange configuration parameters that are defined in values.yaml. Add two new parameter entries to the documentation table: first for resourceQuota with sub-parameters enabled, cpuLimit, and memoryLimit that control resource quotas in the game server namespace, and second for limitRange with sub-parameters enabled, defaultCPULimit, and defaultMemoryLimit that control default resource limits. Follow the same table format as the existing parameter entries, providing clear descriptions and their default values from values.yaml.
🧹 Nitpick comments (1)
chart/pelican-wings/templates/configmap.yaml (1)
1-12: 💤 Low valueConsider renaming file to
secret.yamlfor clarity.The template creates a
Secretresource but is namedconfigmap.yaml. While functionally correct (and good that credentials are stored as a Secret), the filename mismatch could confuse maintainers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/pelican-wings/templates/configmap.yaml` around lines 1 - 12, The file is named configmap.yaml but defines a Kubernetes Secret resource (kind: Secret), creating a naming mismatch that could confuse maintainers. Rename the file from configmap.yaml to secret.yaml to accurately reflect the resource type it contains.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@chart/pelican-wings/README.md`:
- Around line 30-45: The README's "Key Values" documentation table is missing
the resourceQuota and limitRange configuration parameters that are defined in
values.yaml. Add two new parameter entries to the documentation table: first for
resourceQuota with sub-parameters enabled, cpuLimit, and memoryLimit that
control resource quotas in the game server namespace, and second for limitRange
with sub-parameters enabled, defaultCPULimit, and defaultMemoryLimit that
control default resource limits. Follow the same table format as the existing
parameter entries, providing clear descriptions and their default values from
values.yaml.
---
Nitpick comments:
In `@chart/pelican-wings/templates/configmap.yaml`:
- Around line 1-12: The file is named configmap.yaml but defines a Kubernetes
Secret resource (kind: Secret), creating a naming mismatch that could confuse
maintainers. Rename the file from configmap.yaml to secret.yaml to accurately
reflect the resource type it contains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bf4caab9-0551-4889-b062-1c9e2c53fc91
📒 Files selected for processing (28)
.github/workflows/helm-lint.yamlchart/pelican-wings/README.mdchart/pelican-wings/templates/_helpers.tplchart/pelican-wings/templates/clusterrole-metrics.yamlchart/pelican-wings/templates/configmap.yamlchart/pelican-wings/templates/deployment.yamlchart/pelican-wings/templates/limitrange.yamlchart/pelican-wings/templates/resourcequota.yamlchart/pelican-wings/values.yamlenvironment/kubernetes/container.goenvironment/kubernetes/environment_test.goenvironment/kubernetes/installer.goenvironment/kubernetes/installer_test.goenvironment/kubernetes/network.goenvironment/kubernetes/network_test.goenvironment/kubernetes/power.goenvironment/kubernetes/quota.goenvironment/kubernetes/quota_test.goenvironment/kubernetes/stats.goenvironment/kubernetes/storage.goenvironment/kubernetes/storage_test.gokubernetes/README.mdkubernetes/clusterrole-metrics-kubelet.yamlkubernetes/clusterrole-metrics.yamlrouter/router_system.goserver/install_kubernetes.goserver/update.gosystem/system.go
✅ Files skipped from review due to trivial changes (1)
- kubernetes/README.md
🚧 Files skipped from review as they are similar to previous changes (19)
- chart/pelican-wings/templates/limitrange.yaml
- chart/pelican-wings/templates/clusterrole-metrics.yaml
- .github/workflows/helm-lint.yaml
- chart/pelican-wings/templates/deployment.yaml
- chart/pelican-wings/values.yaml
- environment/kubernetes/storage_test.go
- environment/kubernetes/environment_test.go
- environment/kubernetes/quota.go
- environment/kubernetes/quota_test.go
- environment/kubernetes/installer_test.go
- server/install_kubernetes.go
- system/system.go
- environment/kubernetes/storage.go
- environment/kubernetes/network_test.go
- router/router_system.go
- environment/kubernetes/network.go
- environment/kubernetes/power.go
- environment/kubernetes/stats.go
- environment/kubernetes/container.go
📜 Review details
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-03-02T13:53:08.995Z
Learnt from: parkervcp
Repo: pelican-dev/wings PR: 171
File: server/power.go:190-203
Timestamp: 2026-03-02T13:53:08.995Z
Learning: In the server package, when quotas are enabled via config.Get().System.Quotas.Enabled, the disk space check using used >= s.DiskSpace() does not require a special guard for unlimited-disk scenarios (DiskSpace() <= 0). The filesystem handles such cases, so the existing check is sufficient. Apply this pattern to similar quota-related disk checks in the server package and ensure tests/docs reflect that unlimited-disk behavior is governed by the filesystem, not by an extra guard in code.
Applied to files:
server/update.go
🪛 markdownlint-cli2 (0.22.1)
chart/pelican-wings/README.md
[warning] 51-51: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🪛 Trivy (0.69.3)
kubernetes/clusterrole-metrics-kubelet.yaml
[error] 21-23: Do not allow privilege escalation from node proxy
Role permits privilege escalation from node proxy
Rule: KSV-0047
(IaC/Kubernetes)
🪛 YAMLlint (1.37.1)
chart/pelican-wings/templates/configmap.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (11)
chart/pelican-wings/templates/_helpers.tpl (1)
54-61: LGTM!chart/pelican-wings/templates/configmap.yaml (1)
13-80: LGTM!chart/pelican-wings/templates/resourcequota.yaml (1)
1-35: LGTM!kubernetes/clusterrole-metrics.yaml (1)
1-24: LGTM!kubernetes/clusterrole-metrics-kubelet.yaml (1)
1-40: Trivy KSV-0047 acknowledged:nodes/proxyprivilege escalation risk is documented and opt-in.The static analysis correctly flags
nodes/proxyas a privilege escalation vector. However, this is intentionally documented in the header comments (lines 1-10) as an optional, explicitly opted-into fallback for clusters without metrics-server. The separation from the defaultclusterrole-metrics.yamlfollows the principle of least privilege.The hardcoded
pelican-wings/pelicanreferences in the binding (lines 34-35) are appropriate for static manifests, and line 10 instructs users to adjust them.chart/pelican-wings/README.md (1)
47-55: 💤 Low valueFix markdown lint warning: blank line between blockquotes (MD028).
Markdownlint flags a blank line at line 51 separating two blockquotes. To comply with the
no-blanks-blockquoterule, remove the blank line or use an inline separator. The visual appearance remains clear either way.♻️ Proposed fix (option: join with no blank line)
> **Namespace:** Wings schedules game-server workloads into `gameNamespace`, and > the chart creates the namespaced RBAC there. `wings.kubernetes.namespace` is > therefore derived from `gameNamespace`; if you set it explicitly it must match > `gameNamespace` or the chart will fail to render. -> > **Credentials:** `wings.token`, `wings.tokenId`, and `wings.uuid` are rendered > into a Kubernetes **Secret** (not a ConfigMap). Supply them via a private > values file or `--set`, e.g. `helm install ... -f my-creds.yaml`, and keep that > file out of version control.Source: Linters/SAST tools
environment/kubernetes/installer.go (4)
64-189: LGTM!
293-319: LGTM!
372-391: LGTM!
393-419: LGTM!server/update.go (1)
7-7: LGTM!Also applies to: 43-49
… test - server/manager.go: initialize Kubernetes Metadata with the server stop configuration so command/signal stop behavior is correct from first boot. - kubernetes/README.md: sync the namespace-scoped RBAC verb table with role.yaml (pods watch, services list, configmaps update). - chart README: use a local values file for credentials in the quick-start instead of leaking them via --set. - installer_test.go: assert the Run cancellation path returns context.Canceled.
|
This is going to take some time to process and work through. |
Adds an opt-in Kubernetes backend alongside Docker, gated behind kubernetes.enabled (default false). A new environment/kubernetes package implements the ProcessEnvironment interface: Pod lifecycle/power, SPDY attach/exec console, hostport/nodeport/loadbalancer networking, hostpath/PVC storage, optional ResourceQuota/LimitRange, metrics-API stats, and egg installation via a Job + ConfigMap.
Environment selection is gated in server/manager.go and server/install.go; system info and /api/system/ips become backend-aware in system/system.go and router/router_system.go. Docker behavior is unchanged when kubernetes.enabled is false.
Ships raw RBAC manifests under kubernetes/ and a Helm chart under chart/pelican-wings/ for deploying Wings in-cluster, plus a helm-lint workflow. The Kubernetes client libraries (k8s.io/* v0.36.1) require Go >= 1.26.0, so the go directive is bumped accordingly.
Summary by CodeRabbit
Release Notes