[WIP] CNTRLPLANE-3234: health monitor#2193
Conversation
Introduces pkg/operator/encryption/kms/health/: - Probe calls Status() on an injected kmsservice.Service (mirrors preflight's pattern) and classifies the response as ok, unhealthy:not-ok:<truncated-body>, or unhealthy:rpc-error:<gRPC-code>. Never returns an error; failures are baked into HealthStatus.Healthz so the monitor loop stays flat. - StatusWriter / ConfigMapWriter updates an existing ConfigMap in place. Hard-fails on miss so misconfiguration is loud. - Monitor drives the tick loop: probe, stamp observer pod, write, sleep on healthy/unhealthy interval, repeat until ctx cancel. Write errors are logged and tolerated. - health.NewCommand(ctx) wires flags, validates inputs, builds a rest.Config, dials the UDS via the same kmsv2 client preflight uses, and hands a Probe + ConfigMapWriter to the Monitor. --output-mode=condition is reserved and explicitly rejected so the OpenShift track can add it without silent behavior change. Tests use FakeClock with the HasWaiters() barrier for race-free stepping through healthy->unhealthy->healthy transitions.
|
@ibihim: This pull request references CNTRLPLANE-3234 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds a KMSv2 health-monitoring subsystem: probe types and monitor loop, ConfigMap writer and container template generator, CLI command, tests, and a KIND-based test harness including a fake KMS plugin, manifests, build scripts, and verification tooling. Changes
Sequence DiagramsequenceDiagram
participant Monitor as Health Monitor
participant Probe as Probe (RPC wrapper)
participant KMS as Fake KMS Plugin (gRPC)
participant Writer as ConfigMapWriter
participant K8s as Kubernetes API
Monitor->>Probe: Probe(ctx)
Probe->>KMS: Status() RPC
alt Status OK
KMS-->>Probe: {Healthz: "ok", KeyID: "..."}
Probe-->>Monitor: HealthStatus{Healthz: OK, KeyIDHash: "...", Timestamp: t}
else Status NotOK or RPC error
KMS-->>Probe: error / non-ok body
Probe-->>Monitor: HealthStatus{Healthz: NotOK|RPCError, Detail: "...", Timestamp: t}
end
Monitor->>Writer: Write(ctx, HealthStatus)
Writer->>K8s: Patch/Create ConfigMap data[<class>]
K8s-->>Writer: Success
Monitor->>Monitor: sleep(healthyInterval|unhealthyInterval)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ibihim The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| // Cancel inline rather than via defer: defers in unbounded loops | ||
| // accumulate until the function returns. | ||
| writeCtx, writeCancel := context.WithTimeout(ctx, m.writeTimeout) | ||
| err := m.writer.Write(writeCtx, status) |
There was a problem hiding this comment.
TODO:
- Compare hysteresis like pkg/monitor/health?
- (nit) IIFE better than non-defer cancel?
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
examples/kms-health-kind/manifests/static-pod.yaml (1)
17-28: ⚡ Quick winHarden pod/container security context for the harness.
Add explicit
securityContextto disable privilege escalation and lock down defaults.Suggested change
spec: + securityContext: + seccompProfile: + type: RuntimeDefault containers: - name: fake-plugin image: kms-health-kind-fake-plugin:dev imagePullPolicy: Never + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + runAsNonRoot: true + capabilities: + drop: ["ALL"] volumeMounts: - name: sock-tmp mountPath: /tmp🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/kms-health-kind/manifests/static-pod.yaml` around lines 17 - 28, Add explicit pod- and container-level securityContext entries for the fake-plugin container to disable privilege escalation and tighten defaults: under the pod spec add a podSecurityContext with runAsNonRoot: true (and optionally runAsUser and fsGroup), and for the container named "fake-plugin" add a securityContext with allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, runAsNonRoot: true, capabilities: { drop: ["ALL"] }, and seccompProfile: { type: "RuntimeDefault" } so the container and mounted volume "sock-tmp" are locked down.examples/kms-health-kind/manifests/rbac.yaml (1)
16-19: ⚡ Quick winAlign Role verbs with writer operations (
patch+create).Current Role grants
updatebut the writer path usespatchand maycreateonNotFound(pkg/operator/encryption/kms/health/writer_configmap.go, Line 74-99). Addcreateand drop unusedupdateto avoid fallback failures and keep least privilege tight.Suggested change
- verbs: ["get", "update", "patch"] + verbs: ["get", "patch", "create"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/kms-health-kind/manifests/rbac.yaml` around lines 16 - 19, The Role currently grants verbs ["get","update","patch"] for resourceNames ["kms-health-status"]; update it to reflect the writer behavior in pkg/operator/encryption/kms/health/writer_configmap.go (lines ~74-99) by removing "update" and adding "create" so the verbs become ["get","patch","create"], ensuring the writer can patch or create on NotFound while keeping least privilege.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/kms-health-kind/fake-plugin/main.go`:
- Around line 118-124: The shutdown select should not call log.Fatalf on server
exit because a normal SIGTERM can race with the serverErr case; update the
select handling around sig and serverErr so that when err from the serverErr
channel is http.ErrServerClosed (or shutdown was triggered by sig) you treat it
as a normal shutdown (log.Info/Print) instead of exiting non-zero, otherwise log
the unexpected error; use the existing serverErr, sig, ListenAndServe and
server.Shutdown symbols to implement this check and ensure server.Shutdown() is
still called for graceful termination.
In `@examples/kms-health-kind/manifests/configmap.yaml`:
- Around line 1-2: The comment in
examples/kms-health-kind/manifests/configmap.yaml is outdated: it claims the
monitor's ConfigMapWriter "hard-fails" if the ConfigMap is missing, but
writer_configmap.go implements a create fallback on NotFound; update the comment
to reflect that absence is tolerated because ConfigMapWriter (see
ConfigMapWriter in pkg/operator/encryption/kms/health/writer_configmap.go) will
attempt to create the ConfigMap on NotFound rather than hard-fail, so reword the
note to state the ConfigMap is pre-created for convenience but the writer will
create it if missing.
In `@examples/kms-health-kind/manifests/monitor-deployment.yaml`:
- Around line 19-58: The pod lacks an explicit securityContext and the monitor
container has no container-level security hardening; add a pod-level
spec.securityContext and a containers[ name: "monitor" ].securityContext to
enforce non-root and immutable filesystem policies: set runAsNonRoot: true (and
runAsUser to a non-root UID if known), set readOnlyRootFilesystem: true, set
allowPrivilegeEscalation: false, drop all capabilities and add a conservative
seccompProfile (e.g., RuntimeDefault) and/or nonRootGroup as appropriate; ensure
these changes work with the existing volumeMount sock-tmp (which is already
readOnly: true) and apply the policies to the "monitor" container and pod
spec.securityContext.
In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Line 16: The pod manifest currently sets hostNetwork: true which unnecessarily
broadens network exposure for a pod that communicates via a Unix socket under
/tmp; remove the hostNetwork: true line (or set it to false) in the
static-pod.yaml Pod spec so the Pod uses the cluster networking namespace
instead, ensuring the pod still communicates over the /tmp Unix socket without
host network access.
In `@pkg/operator/encryption/kms/health/cmd.go`:
- Around line 143-145: validate() currently only checks writeTimeout > 0 and
does not enforce the documented cadence budget; update validate() (the method
that checks o.writeTimeout and o.probeTimeout) to also verify that
o.probeTimeout is positive and that the sum o.probeTimeout + o.writeTimeout is
strictly less than probeIntervalUnhealthy (as documented in monitor.go). If the
sum violates the budget, return a clear fmt.Errorf that the combined
probeTimeout+writeTimeout must be below probeIntervalUnhealthy so callers cannot
configure values that break the fast-retry unhealthy cadence.
- Line 207: The probe is constructed with a hardcoded 0 timeout which prevents
the monitor loop from applying the configured deadline to Probe.Status(); change
the NewProbe call (currently NewProbe(service, 0)) to pass the validated probe
timeout variable used when dialing the KMS client (the same "--probe-timeout"
value/variable) so the probe enforces that deadline; update the call site where
NewProbe(service, 0) is invoked to NewProbe(service,
<configuredProbeTimeoutVariable>) so Status() honors the configured timeout.
In `@pkg/operator/encryption/kms/health/monitor_test.go`:
- Around line 129-130: The test currently does a blocking receive on done after
calling cancel(), which can hang if monitor.Run fails to observe the
cancellation; change the shutdown wait to a timeout-backed select (as used in
other tests): after calling cancel(), wait on either <-done or a time.After
timeout and fail the test (t.Fatalf / t.Fatal) on timeout so the test surfaces
failures instead of blocking; update the test around the Run invocation and uses
of cancel() and done to implement this timed select.
---
Nitpick comments:
In `@examples/kms-health-kind/manifests/rbac.yaml`:
- Around line 16-19: The Role currently grants verbs ["get","update","patch"]
for resourceNames ["kms-health-status"]; update it to reflect the writer
behavior in pkg/operator/encryption/kms/health/writer_configmap.go (lines
~74-99) by removing "update" and adding "create" so the verbs become
["get","patch","create"], ensuring the writer can patch or create on NotFound
while keeping least privilege.
In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Around line 17-28: Add explicit pod- and container-level securityContext
entries for the fake-plugin container to disable privilege escalation and
tighten defaults: under the pod spec add a podSecurityContext with runAsNonRoot:
true (and optionally runAsUser and fsGroup), and for the container named
"fake-plugin" add a securityContext with allowPrivilegeEscalation: false,
readOnlyRootFilesystem: true, runAsNonRoot: true, capabilities: { drop: ["ALL"]
}, and seccompProfile: { type: "RuntimeDefault" } so the container and mounted
volume "sock-tmp" are locked down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 18bb6413-9d46-45e1-887d-b17e07b4b5f1
📒 Files selected for processing (26)
examples/kms-health-kind/Dockerfile.fake-pluginexamples/kms-health-kind/Dockerfile.monitorexamples/kms-health-kind/Makefileexamples/kms-health-kind/bin/down.shexamples/kms-health-kind/bin/up.shexamples/kms-health-kind/cmd/monitor/main.goexamples/kms-health-kind/fake-plugin/main.goexamples/kms-health-kind/manifests/configmap.yamlexamples/kms-health-kind/manifests/encryption-config.yamlexamples/kms-health-kind/manifests/kind.yamlexamples/kms-health-kind/manifests/monitor-deployment.yamlexamples/kms-health-kind/manifests/namespace.yamlexamples/kms-health-kind/manifests/rbac.yamlexamples/kms-health-kind/manifests/static-pod.yamlexamples/kms-health-kind/verify.shpkg/operator/encryption/kms/health/cmd.gopkg/operator/encryption/kms/health/cmd_test.gopkg/operator/encryption/kms/health/doc.gopkg/operator/encryption/kms/health/monitor.gopkg/operator/encryption/kms/health/monitor_test.gopkg/operator/encryption/kms/health/probe.gopkg/operator/encryption/kms/health/probe_test.gopkg/operator/encryption/kms/health/types.gopkg/operator/encryption/kms/health/writer.gopkg/operator/encryption/kms/health/writer_configmap.gopkg/operator/encryption/kms/health/writer_configmap_test.go
| select { | ||
| case <-sig: | ||
| log.Print("fake-plugin: signal received, shutting down") | ||
| case err := <-serverErr: | ||
| log.Fatalf("fake-plugin: server exited: %v", err) | ||
| } | ||
| server.Shutdown() |
There was a problem hiding this comment.
Don't fatal on the shutdown path.
ListenAndServe() is observed on a separate goroutine, so a normal SIGTERM can race the serverErr case. log.Fatalf will then exit non-zero even when the plugin is stopping intentionally, which makes the KIND harness teardown flaky.
🔧 Suggested fix
select {
case <-sig:
log.Print("fake-plugin: signal received, shutting down")
case err := <-serverErr:
- log.Fatalf("fake-plugin: server exited: %v", err)
+ if err != nil {
+ log.Fatalf("fake-plugin: server exited: %v", err)
+ }
+ return
}
server.Shutdown()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| select { | |
| case <-sig: | |
| log.Print("fake-plugin: signal received, shutting down") | |
| case err := <-serverErr: | |
| log.Fatalf("fake-plugin: server exited: %v", err) | |
| } | |
| server.Shutdown() | |
| select { | |
| case <-sig: | |
| log.Print("fake-plugin: signal received, shutting down") | |
| case err := <-serverErr: | |
| if err != nil { | |
| log.Fatalf("fake-plugin: server exited: %v", err) | |
| } | |
| return | |
| } | |
| server.Shutdown() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/kms-health-kind/fake-plugin/main.go` around lines 118 - 124, The
shutdown select should not call log.Fatalf on server exit because a normal
SIGTERM can race with the serverErr case; update the select handling around sig
and serverErr so that when err from the serverErr channel is
http.ErrServerClosed (or shutdown was triggered by sig) you treat it as a normal
shutdown (log.Info/Print) instead of exiting non-zero, otherwise log the
unexpected error; use the existing serverErr, sig, ListenAndServe and
server.Shutdown symbols to implement this check and ensure server.Shutdown() is
still called for graceful termination.
| spec: | ||
| serviceAccountName: kms-health-monitor | ||
| nodeSelector: | ||
| node-role.kubernetes.io/control-plane: "" | ||
| tolerations: | ||
| - key: node-role.kubernetes.io/control-plane | ||
| operator: Exists | ||
| effect: NoSchedule | ||
| containers: | ||
| - name: monitor | ||
| image: kms-health-kind-monitor:dev | ||
| imagePullPolicy: Never | ||
| # Harness overrides the committed production numbers so | ||
| # verify.sh completes in a reasonable wall-clock budget. | ||
| args: | ||
| - --kms-socket=/tmp/kms.sock | ||
| # Harness intervals are tighter than production (60s/10s/3s) | ||
| # so verify.sh completes in a reasonable wall-clock budget. | ||
| # Assertion 6 (rpc-error detection) does not depend on these: | ||
| # fake-plugin's staydown marker forces a 3s outage on restart, | ||
| # which is long enough for any reasonable probe-timeout. | ||
| - --probe-interval=2s | ||
| - --probe-interval-unhealthy=1s | ||
| - --probe-timeout=1s | ||
| - --configmap-namespace=kms-health-test | ||
| - --configmap-name=kms-health-status | ||
| env: | ||
| - name: POD_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.name | ||
| volumeMounts: | ||
| - name: sock-tmp | ||
| mountPath: /tmp | ||
| readOnly: true | ||
| volumes: | ||
| - name: sock-tmp | ||
| hostPath: | ||
| path: /tmp | ||
| type: Directory |
There was a problem hiding this comment.
Lock down the monitor pod security context.
This pod currently inherits the default security context, so it can run with a writable root filesystem and without an explicit no-escalation guarantee. The container doesn't need either, so tightening this should be a small change.
🔒 Suggested hardening
spec:
+ securityContext:
+ runAsNonRoot: true
+ seccompProfile:
+ type: RuntimeDefault
serviceAccountName: kms-health-monitor
nodeSelector:
node-role.kubernetes.io/control-plane: ""
@@
containers:
- name: monitor
image: kms-health-kind-monitor:dev
imagePullPolicy: Never
+ securityContext:
+ allowPrivilegeEscalation: false
+ readOnlyRootFilesystem: true🧰 Tools
🪛 Checkov (3.2.525)
[medium] 5-58: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 5-58: Minimize the admission of root containers
(CKV_K8S_23)
🪛 Trivy (0.69.3)
[error] 28-53: Root file system is not read-only
Container 'monitor' of Deployment 'kms-health-monitor' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 28-53: Default security context configured
container kms-health-monitor in kms-health-test namespace is using the default security context
Rule: KSV-0118
(IaC/Kubernetes)
[error] 19-58: Default security context configured
deployment kms-health-monitor in kms-health-test namespace is using the default security context, which allows root privileges
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/kms-health-kind/manifests/monitor-deployment.yaml` around lines 19 -
58, The pod lacks an explicit securityContext and the monitor container has no
container-level security hardening; add a pod-level spec.securityContext and a
containers[ name: "monitor" ].securityContext to enforce non-root and immutable
filesystem policies: set runAsNonRoot: true (and runAsUser to a non-root UID if
known), set readOnlyRootFilesystem: true, set allowPrivilegeEscalation: false,
drop all capabilities and add a conservative seccompProfile (e.g.,
RuntimeDefault) and/or nonRootGroup as appropriate; ensure these changes work
with the existing volumeMount sock-tmp (which is already readOnly: true) and
apply the policies to the "monitor" container and pod spec.securityContext.
| tier: control-plane | ||
| component: kms-fake-plugin | ||
| spec: | ||
| hostNetwork: true |
There was a problem hiding this comment.
Remove hostNetwork: true unless strictly required.
Line 16 broadens network exposure but this pod is coordinating through a Unix socket on /tmp, so host networking appears unnecessary.
Suggested change
spec:
- hostNetwork: true
containers:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hostNetwork: true | |
| spec: | |
| containers: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/kms-health-kind/manifests/static-pod.yaml` at line 16, The pod
manifest currently sets hostNetwork: true which unnecessarily broadens network
exposure for a pod that communicates via a Unix socket under /tmp; remove the
hostNetwork: true line (or set it to false) in the static-pod.yaml Pod spec so
the Pod uses the cluster networking namespace instead, ensuring the pod still
communicates over the /tmp Unix socket without host network access.
| if o.writeTimeout <= 0 { | ||
| return fmt.Errorf("--write-timeout must be positive") | ||
| } |
There was a problem hiding this comment.
Enforce the unhealthy cadence budget.
monitor.go documents that probeTimeout + writeTimeout must stay below probeIntervalUnhealthy, but validate() only checks that writeTimeout is positive. A caller can currently configure a write timeout that pushes the loop past the unhealthy cadence and defeats the fast-retry behavior.
♻️ Suggested fix
if o.writeTimeout <= 0 {
return fmt.Errorf("--write-timeout must be positive")
}
+ if o.probeTimeout+o.writeTimeout >= o.probeIntervalUnhealthy {
+ return fmt.Errorf(
+ "--probe-timeout + --write-timeout must be smaller than --probe-interval-unhealthy",
+ )
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if o.writeTimeout <= 0 { | |
| return fmt.Errorf("--write-timeout must be positive") | |
| } | |
| if o.writeTimeout <= 0 { | |
| return fmt.Errorf("--write-timeout must be positive") | |
| } | |
| if o.probeTimeout+o.writeTimeout >= o.probeIntervalUnhealthy { | |
| return fmt.Errorf( | |
| "--probe-timeout + --write-timeout must be smaller than --probe-interval-unhealthy", | |
| ) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/cmd.go` around lines 143 - 145, validate()
currently only checks writeTimeout > 0 and does not enforce the documented
cadence budget; update validate() (the method that checks o.writeTimeout and
o.probeTimeout) to also verify that o.probeTimeout is positive and that the sum
o.probeTimeout + o.writeTimeout is strictly less than probeIntervalUnhealthy (as
documented in monitor.go). If the sum violates the budget, return a clear
fmt.Errorf that the combined probeTimeout+writeTimeout must be below
probeIntervalUnhealthy so callers cannot configure values that break the
fast-retry unhealthy cadence.
| if err != nil { | ||
| return fmt.Errorf("dial KMS plugin at %q: %w", endpoint, err) | ||
| } | ||
| probe := NewProbe(service, 0) |
There was a problem hiding this comment.
Pass the configured probe timeout into NewProbe.
--probe-timeout is validated and used when dialing the KMS client, but the probe itself is still constructed with 0, so the monitor loop never applies the configured deadline to Status(). If the plugin hangs, the monitor can stall indefinitely and the timeout flag becomes misleading.
🔧 Suggested fix
- probe := NewProbe(service, 0)
+ probe := NewProbe(service, o.probeTimeout)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| probe := NewProbe(service, 0) | |
| probe := NewProbe(service, o.probeTimeout) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/cmd.go` at line 207, The probe is
constructed with a hardcoded 0 timeout which prevents the monitor loop from
applying the configured deadline to Probe.Status(); change the NewProbe call
(currently NewProbe(service, 0)) to pass the validated probe timeout variable
used when dialing the KMS client (the same "--probe-timeout" value/variable) so
the probe enforces that deadline; update the call site where NewProbe(service,
0) is invoked to NewProbe(service, <configuredProbeTimeoutVariable>) so Status()
honors the configured timeout.
| cancel() | ||
| <-done |
There was a problem hiding this comment.
Bound the shutdown wait to avoid hanging the test run.
Line 130 blocks indefinitely if Run fails to observe cancellation. Use a timeout-backed select (as done in the other tests) so failures surface quickly instead of stalling CI.
Suggested fix
cancel()
- <-done
+ select {
+ case <-done:
+ case <-time.After(2 * time.Second):
+ t.Fatal("Monitor did not stop within 2s of context cancel")
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cancel() | |
| <-done | |
| cancel() | |
| select { | |
| case <-done: | |
| case <-time.After(2 * time.Second): | |
| t.Fatal("Monitor did not stop within 2s of context cancel") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/monitor_test.go` around lines 129 - 130,
The test currently does a blocking receive on done after calling cancel(), which
can hang if monitor.Run fails to observe the cancellation; change the shutdown
wait to a timeout-backed select (as used in other tests): after calling
cancel(), wait on either <-done or a time.After timeout and fail the test
(t.Fatalf / t.Fatal) on timeout so the test surfaces failures instead of
blocking; update the test around the Run invocation and uses of cancel() and
done to implement this timed select.
ardaguclu
left a comment
There was a problem hiding this comment.
I know it is not ready for review yet. Nevertheless I dropped a comment that can be useful. Also design needs to go here https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/kms-encryption-foundations.md
| @@ -0,0 +1,238 @@ | |||
| package health | |||
There was a problem hiding this comment.
What this package is supposed to do is very similar to https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/kms/preflight/checker.go. I think it is better to mimic from it.
The only difference is health monitor will periodically write the status to somewhere.
There was a problem hiding this comment.
What are the differences? Both are cobra commands. They have flags, validations and a run method.
I create a reader, a writer and put them into a monitor. Writer is an interface, because I want to pivot to operator conditions.
There was a problem hiding this comment.
yeah, it seems there is room for reuse. it would be better to have a single binary with custom logic.
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
examples/kms-health-kind/fake-plugin/main.go (1)
118-124:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't fatal on the normal shutdown path.
serverErrcan race with the SIGTERM/SIGINT branch, so a clean shutdown can still hitlog.Fatalfand make the KIND harness exit non-zero.Suggested fix
select { case <-sig: log.Print("fake-plugin: signal received, shutting down") case err := <-serverErr: - log.Fatalf("fake-plugin: server exited: %v", err) + if err != nil { + log.Fatalf("fake-plugin: server exited: %v", err) + } + return } server.Shutdown()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/kms-health-kind/fake-plugin/main.go` around lines 118 - 124, The select currently calls log.Fatalf on any value from serverErr which can race with the sig shutdown path; change the handling in main so you do not call log.Fatalf on the normal shutdown path: replace log.Fatalf("fake-plugin: server exited: %v", err) with a non-fatal log (e.g., log.Printf) and only treat non-nil, unexpected errors as fatal if appropriate, then call server.Shutdown() and drain/wait for serverErr to ensure the server goroutine has finished. Update the select case for err := <-serverErr and the shutdown sequence around server.Shutdown() so normal SIGTERM/SIGINT leads to a clean, non-zero-exit-free shutdown.
🧹 Nitpick comments (1)
examples/kms-health-kind/manifests/static-pod.yaml (1)
15-28: ⚡ Quick winAdd an explicit securityContext for this static pod.
This pod runs in
kube-systemand mounts a hostPath, but it still inherits the default root-capable security settings. Even for the kind harness, it is worth pinning this down with an explicit pod/containersecurityContext(seccompProfile: RuntimeDefault,allowPrivilegeEscalation: false, dropped capabilities, andrunAsNonRootif the image supports it).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/kms-health-kind/manifests/static-pod.yaml` around lines 15 - 28, Add an explicit pod and container securityContext for the static pod (under spec and the container named "fake-plugin"): set pod-level securityContext.seccompProfile.type to "RuntimeDefault" and at the container level set securityContext.allowPrivilegeEscalation=false, securityContext.capabilities.drop=["ALL"], and securityContext.runAsNonRoot=true (only enable runAsNonRoot if the image supports it); apply these to the spec and the container named "fake-plugin" so the hostPath-mounted static pod in kube-system runs with constrained privileges.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/kms-health-kind/bin/up.sh`:
- Around line 50-70: The script stops the fake-plugin static pod (FAKE_CID via
crictl stop) but then immediately applies the monitor Deployment, which can roll
out before the static pod/socket is ready; add a polling loop after docker exec
"${NODE}" crictl stop "${FAKE_CID}" that waits (with a short sleep and overall
timeout) for the fake-plugin container to reappear and be Running (e.g., poll
docker exec "${NODE}" crictl ps --name fake-plugin -q and check crictl inspect
--output go-template --template '{{.status.state}}' or use a socket probe like
test -S /path/to/socket inside the node) before proceeding to apply the
namespace/RBAC/monitor and running kubectl rollout status; reference FAKE_CID,
NODE, and the subsequent kubectl rollout status
deployment/kms-health-monitor-fake when adding the wait.
In `@examples/kms-health-kind/Makefile`:
- Around line 47-55: The Makefile invocation of render-container uses the
--command flag set to /usr/bin/kms-health-monitor which doesn't match the binary
location installed by Dockerfile.monitor (/monitor); update the --command
argument in the render-container call (replace
--command=/usr/bin/kms-health-monitor) to point to /monitor so the pod spec
references the actual binary path used in the image.
In `@examples/kms-health-kind/manifests/monitor-deployment.yaml`:
- Around line 28-53: The container definition for kms-health-monitor-fake
currently only specifies args, leaving the image entrypoint implicit; add an
explicit command stanza to match the shared template so the pod no longer
depends on the image entrypoint drift (locate the container with name:
kms-health-monitor-fake and add the same command array used by the shared
template above the args block), or replace this hand-copied container block with
the generated/shared template to ensure the command and args are kept in sync
with the harness expectations.
In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml`:
- Around line 13-15: The ConfigMap name is hard-coded as
--configmap-name=kms-health-{{.KMSPluginName}}, which violates ConfigMapWriter's
one-per-monitor-instance contract and causes writers to clobber each other;
change the template to include unique pod/instance identity (for example append
the pod name or UID templated value) or remove the --configmap-name flag so the
cmd-layer default can generate a per-instance name; update the kms-health
container args in kms-health-container.yaml accordingly (referencing the
--configmap-name flag and ConfigMapWriter/ConfigMapNamespace/ConfigMapName
symbols) so each sidecar writes to a distinct ConfigMap.
In `@pkg/operator/encryption/kms/health/container_template.go`:
- Around line 92-96: The current ContainerOptions.validate() only validates
KMSPluginName itself but not the derived resource names, causing Kubernetes to
reject pod/volume names like "kms-health-monitor-<name>" or
"kms-plugin-socket-<name>" when they exceed DNS-1123 limits; update
ContainerOptions.validate() to construct the derived names (e.g.,
"kms-health-monitor-"+o.KMSPluginName and "kms-plugin-socket-"+o.KMSPluginName)
and run validation.IsDNS1123Label() (or check length <= 63 and allowed chars) on
each, returning a clear fmt.Errorf referencing the failing derived name if
validation fails so podspec generation won’t produce invalid names.
- Around line 100-102: The current check only ensures o.OperatorCommand has a
non-zero length but allows empty/whitespace elements; update the validation
where OperatorCommand is checked to iterate over each element, trim whitespace
(strings.TrimSpace) and reject any empty entries by returning an error (e.g.,
fmt.Errorf) that names the invalid element(s) or their indices; ensure the check
happens before rendering/using OperatorCommand so values like "" or " " (or
malformed CSV-split inputs) are rejected rather than producing an invalid
container entrypoint.
In `@pkg/operator/encryption/kms/health/writer_configmap.go`:
- Around line 61-66: The Timestamp stored in classEntry currently uses
status.Timestamp.UTC().Format(time.RFC3339) which drops fractional seconds;
update the serialization in the classEntry construction (field Timestamp) to
preserve sub-second precision by formatting with time.RFC3339Nano or by writing
a numeric epoch (e.g., status.Timestamp.UTC().UnixNano()) so rapid transitions
within the same second remain totally ordered when readers compare timestamps.
---
Duplicate comments:
In `@examples/kms-health-kind/fake-plugin/main.go`:
- Around line 118-124: The select currently calls log.Fatalf on any value from
serverErr which can race with the sig shutdown path; change the handling in main
so you do not call log.Fatalf on the normal shutdown path: replace
log.Fatalf("fake-plugin: server exited: %v", err) with a non-fatal log (e.g.,
log.Printf) and only treat non-nil, unexpected errors as fatal if appropriate,
then call server.Shutdown() and drain/wait for serverErr to ensure the server
goroutine has finished. Update the select case for err := <-serverErr and the
shutdown sequence around server.Shutdown() so normal SIGTERM/SIGINT leads to a
clean, non-zero-exit-free shutdown.
---
Nitpick comments:
In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Around line 15-28: Add an explicit pod and container securityContext for the
static pod (under spec and the container named "fake-plugin"): set pod-level
securityContext.seccompProfile.type to "RuntimeDefault" and at the container
level set securityContext.allowPrivilegeEscalation=false,
securityContext.capabilities.drop=["ALL"], and securityContext.runAsNonRoot=true
(only enable runAsNonRoot if the image supports it); apply these to the spec and
the container named "fake-plugin" so the hostPath-mounted static pod in
kube-system runs with constrained privileges.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d618721a-cbe2-4be7-821e-5d31f663463d
📒 Files selected for processing (14)
examples/kms-health-kind/Makefileexamples/kms-health-kind/bin/up.shexamples/kms-health-kind/cmd/render-container/main.goexamples/kms-health-kind/fake-plugin/main.goexamples/kms-health-kind/manifests/encryption-config.yamlexamples/kms-health-kind/manifests/monitor-deployment.yamlexamples/kms-health-kind/manifests/rbac.yamlexamples/kms-health-kind/manifests/static-pod.yamlexamples/kms-health-kind/verify.shpkg/operator/encryption/kms/health/assets/kms-health-container.yamlpkg/operator/encryption/kms/health/bindata.gopkg/operator/encryption/kms/health/container_template.gopkg/operator/encryption/kms/health/container_template_test.gopkg/operator/encryption/kms/health/writer_configmap.go
✅ Files skipped from review due to trivial changes (2)
- examples/kms-health-kind/manifests/encryption-config.yaml
- examples/kms-health-kind/manifests/rbac.yaml
| # 3) Force the static pod to restart so it picks up the freshly-loaded | ||
| # image (kubelet may have cached an ErrImageNeverPull from before). | ||
| echo "[up.sh] bouncing fake-plugin container so it picks up loaded image" | ||
| FAKE_CID="$(docker exec "${NODE}" crictl ps --name fake-plugin -q 2>/dev/null | head -1 || true)" | ||
| if [ -n "${FAKE_CID}" ]; then | ||
| docker exec "${NODE}" crictl stop "${FAKE_CID}" >/dev/null || true | ||
| fi | ||
|
|
||
| # 4) Apply namespace + RBAC + monitor Deployment. The status ConfigMap | ||
| # is created by the writer itself on first observation; nothing | ||
| # pre-creates it. | ||
| echo "[up.sh] applying namespace, RBAC, monitor Deployment" | ||
| kubectl --context "${CTX}" apply -f manifests/namespace.yaml | ||
| kubectl --context "${CTX}" apply -f manifests/rbac.yaml | ||
| kubectl --context "${CTX}" apply -f manifests/monitor-deployment.yaml | ||
|
|
||
| # 5) Wait for the monitor Deployment to be Available. Apiserver is NOT | ||
| # wired to KMS here (see manifests/kind.yaml) — we're only | ||
| # validating the monitor, not apiserver/KMS integration. | ||
| echo "[up.sh] waiting for monitor Deployment" | ||
| kubectl --context "${CTX}" -n kms-health-test rollout status deployment/kms-health-monitor-fake --timeout=120s |
There was a problem hiding this comment.
Wait for the fake plugin to come back before rolling out the monitor.
After crictl stop, the script immediately deploys the monitor and only waits for the Deployment to be Available. Because the monitor has no readiness gate on the socket, rollout can succeed while the static pod is still restarting, which makes the first observed health state racey and can flake the harness. Poll the static pod or socket path before continuing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/kms-health-kind/bin/up.sh` around lines 50 - 70, The script stops
the fake-plugin static pod (FAKE_CID via crictl stop) but then immediately
applies the monitor Deployment, which can roll out before the static pod/socket
is ready; add a polling loop after docker exec "${NODE}" crictl stop
"${FAKE_CID}" that waits (with a short sleep and overall timeout) for the
fake-plugin container to reappear and be Running (e.g., poll docker exec
"${NODE}" crictl ps --name fake-plugin -q and check crictl inspect --output
go-template --template '{{.status.state}}' or use a socket probe like test -S
/path/to/socket inside the node) before proceeding to apply the
namespace/RBAC/monitor and running kubectl rollout status; reference FAKE_CID,
NODE, and the subsequent kubectl rollout status
deployment/kms-health-monitor-fake when adding the wait.
| containers: | ||
| - name: kms-health-monitor-fake | ||
| image: kms-health-kind-monitor:dev | ||
| imagePullPolicy: Never | ||
| # Tighter intervals than library defaults so verify.sh fits in budget. | ||
| args: | ||
| - --kms-socket=/var/run/kmsplugin-fake/kms.sock | ||
| - --probe-interval=2s | ||
| - --probe-interval-unhealthy=1s | ||
| - --probe-timeout=1s | ||
| - --write-timeout=5s | ||
| - --output-mode=configmap | ||
| - --configmap-namespace=kms-health-test | ||
| - --configmap-name=kms-health-fake | ||
| env: | ||
| - name: POD_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.name | ||
| volumeMounts: | ||
| - name: kms-plugin-socket-fake | ||
| mountPath: /var/run/kmsplugin-fake | ||
| resources: | ||
| requests: | ||
| memory: 50Mi | ||
| cpu: 5m |
There was a problem hiding this comment.
Make the monitor command explicit here too.
The shared template always renders a command, but this manifest only mirrors the args. That makes the harness depend on the image entrypoint matching the template and weakens the "drift makes kind-verify go red" guarantee in the header. Add the same command stanza here, or generate this block from the shared template instead of hand-copying it.
🧰 Tools
🪛 Trivy (0.69.3)
[error] 29-53: Root file system is not read-only
Container 'kms-health-monitor-fake' of Deployment 'kms-health-monitor-fake' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 29-53: Default security context configured
container kms-health-monitor-fake in kms-health-test namespace is using the default security context
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/kms-health-kind/manifests/monitor-deployment.yaml` around lines 28 -
53, The container definition for kms-health-monitor-fake currently only
specifies args, leaving the image entrypoint implicit; add an explicit command
stanza to match the shared template so the pod no longer depends on the image
entrypoint drift (locate the container with name: kms-health-monitor-fake and
add the same command array used by the shared template above the args block), or
replace this hand-copied container block with the generated/shared template to
ensure the command and args are kept in sync with the harness expectations.
| - --output-mode=configmap | ||
| - --configmap-namespace={{.ConfigMapNamespace}} | ||
| - --configmap-name=kms-health-{{.KMSPluginName}} |
There was a problem hiding this comment.
Don't hard-code a shared ConfigMap name here.
ConfigMapWriter's concurrency contract requires one ConfigMap per monitor instance, but this template makes every sidecar for the same plugin write kms-health-{{.KMSPluginName}}. During rollouts or multiple replicas, those writers will clobber each other's class keys and observerPod. Either encode pod identity in the name here or let the cmd-layer default choose the per-instance name.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml` around
lines 13 - 15, The ConfigMap name is hard-coded as
--configmap-name=kms-health-{{.KMSPluginName}}, which violates ConfigMapWriter's
one-per-monitor-instance contract and causes writers to clobber each other;
change the template to include unique pod/instance identity (for example append
the pod name or UID templated value) or remove the --configmap-name flag so the
cmd-layer default can generate a per-instance name; update the kms-health
container args in kms-health-container.yaml accordingly (referencing the
--configmap-name flag and ConfigMapWriter/ConfigMapNamespace/ConfigMapName
symbols) so each sidecar writes to a distinct ConfigMap.
| func (o ContainerOptions) validate() error { | ||
| if errs := validation.IsDNS1123Label(o.KMSPluginName); len(errs) > 0 { | ||
| return fmt.Errorf("KMSPluginName %q is not a valid DNS-1123 label: %s", | ||
| o.KMSPluginName, strings.Join(errs, "; ")) | ||
| } |
There was a problem hiding this comment.
Validate the derived names, not just KMSPluginName.
A 63-character plugin name passes IsDNS1123Label, but kms-health-monitor-<name> and kms-plugin-socket-<name> then exceed the 63-character DNS label limit for container and volume names. That means validation succeeds here and the PodSpec fails later when Kubernetes rejects those derived names.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/container_template.go` around lines 92 -
96, The current ContainerOptions.validate() only validates KMSPluginName itself
but not the derived resource names, causing Kubernetes to reject pod/volume
names like "kms-health-monitor-<name>" or "kms-plugin-socket-<name>" when they
exceed DNS-1123 limits; update ContainerOptions.validate() to construct the
derived names (e.g., "kms-health-monitor-"+o.KMSPluginName and
"kms-plugin-socket-"+o.KMSPluginName) and run validation.IsDNS1123Label() (or
check length <= 63 and allowed chars) on each, returning a clear fmt.Errorf
referencing the failing derived name if validation fails so podspec generation
won’t produce invalid names.
| if len(o.OperatorCommand) == 0 { | ||
| return fmt.Errorf("OperatorCommand must have at least one element") | ||
| } |
There was a problem hiding this comment.
Reject empty command elements.
OperatorCommand only checks len > 0, so callers can still pass []string{"", ...} or a value like --command=a,,b through the render CLI and emit an invalid entrypoint. Validate each element after trimming whitespace before rendering.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/container_template.go` around lines 100 -
102, The current check only ensures o.OperatorCommand has a non-zero length but
allows empty/whitespace elements; update the validation where OperatorCommand is
checked to iterate over each element, trim whitespace (strings.TrimSpace) and
reject any empty entries by returning an error (e.g., fmt.Errorf) that names the
invalid element(s) or their indices; ensure the check happens before
rendering/using OperatorCommand so values like "" or " " (or malformed CSV-split
inputs) are rejected rather than producing an invalid container entrypoint.
| entry := classEntry{ | ||
| Timestamp: status.Timestamp.UTC().Format(time.RFC3339), | ||
| ObserverPod: status.ObserverPod, | ||
| Detail: status.Healthz.Detail, | ||
| KeyIDHash: status.KeyIDHash, | ||
| } |
There was a problem hiding this comment.
Preserve sub-second precision in the stored timestamp.
Readers pick the current class by comparing timestamps, but Format(time.RFC3339) drops fractional seconds. Two transitions inside the same second then serialize identically, so ok/not-ok/rpc-error ordering becomes ambiguous. RFC3339Nano or a numeric epoch would avoid that.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/encryption/kms/health/writer_configmap.go` around lines 61 - 66,
The Timestamp stored in classEntry currently uses
status.Timestamp.UTC().Format(time.RFC3339) which drops fractional seconds;
update the serialization in the classEntry construction (field Timestamp) to
preserve sub-second precision by formatting with time.RFC3339Nano or by writing
a numeric epoch (e.g., status.Timestamp.UTC().UnixNano()) so rapid transitions
within the same second remain totally ordered when readers compare timestamps.
GenerateContainerTemplate renders a corev1.Container sidecar. KMSPluginName suffixes container name, volumeMount, kms-socket path, and configmap name so multi-plugin coexists in one pod. Drive-by: ConfigMapWriter godoc updated to match the actual patch-then-create flow.
Self-contained throwaway harness for validating the health package
against a real KIND cluster. Defaults to cluster name kms-health-claude.
fake-plugin/ ~90-line k8s.io/kms/pkg/service.Service
with AES-GCM and a file-marker unhealthy
toggle.
cmd/monitor/ wraps health.NewCommand.
cmd/render-container/ wraps health.GenerateContainerTemplate;
the harness Deployment mirrors its output.
manifests/ fake-plugin runs as a static pod in
kube-system; monitor runs as a Deployment
in kms-health-test pinned to the
control-plane node; both share the host's
/tmp bind-mounted at /var/run/kmsplugin-fake/
for the UDS. RBAC splits into two rules
since k8s ignores create on rules with
resourceNames.
verify.sh six end-to-end assertions.
Drop this commit before merging to keep examples/ out of the library.
90eaad7 to
b0d6ae6
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (5)
pkg/operator/encryption/kms/health/container_template.go (2)
100-102:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject empty/whitespace
OperatorCommandelements.
len(o.OperatorCommand) > 0is not sufficient; values like[]string{"", "kms-health-monitor"}pass validation and can produce invalid container command execution.Suggested fix
if len(o.OperatorCommand) == 0 { return fmt.Errorf("OperatorCommand must have at least one element") } + for i, arg := range o.OperatorCommand { + if strings.TrimSpace(arg) == "" { + return fmt.Errorf("OperatorCommand[%d] must be non-empty", i) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/encryption/kms/health/container_template.go` around lines 100 - 102, The current check in the validation for OperatorCommand (the block that only checks len(o.OperatorCommand) == 0) allows empty or whitespace-only entries; update the validation in the same function (the OperatorCommand validation in container_template.go) to iterate over o.OperatorCommand, trim whitespace for each element (e.g., strings.TrimSpace) and return an error if any element is empty after trimming, including the offending index/value in the error message so callers know which command entry is invalid.
92-96:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate derived Kubernetes names, not only
KMSPluginName.A plugin name can be DNS-1123 valid while derived names (e.g.,
kms-health-monitor-<name>,kms-plugin-socket-<name>) become invalid for label-length/name rules and fail later at Pod creation.Suggested fix
func (o ContainerOptions) validate() error { if errs := validation.IsDNS1123Label(o.KMSPluginName); len(errs) > 0 { return fmt.Errorf("KMSPluginName %q is not a valid DNS-1123 label: %s", o.KMSPluginName, strings.Join(errs, "; ")) } + for _, derived := range []string{ + "kms-health-monitor-" + o.KMSPluginName, + "kms-plugin-socket-" + o.KMSPluginName, + } { + if errs := validation.IsDNS1123Label(derived); len(errs) > 0 { + return fmt.Errorf("derived name %q is not a valid DNS-1123 label: %s", + derived, strings.Join(errs, "; ")) + } + } if o.OperatorImage == "" { return fmt.Errorf("OperatorImage is required") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/encryption/kms/health/container_template.go` around lines 92 - 96, The validate method on ContainerOptions currently only checks KMSPluginName; update ContainerOptions.validate to also construct the derived resource names (e.g., "kms-health-monitor-"+o.KMSPluginName and "kms-plugin-socket-"+o.KMSPluginName and any other names used by this package) and run the same DNS/length validation (use validation.IsDNS1123Label and/or explicit length checks) on each derived name, returning a clear fmt.Errorf if any derived name is invalid and naming which derived name failed; reference the ContainerOptions.validate function and the KMSPluginName field when implementing this additional validation so failures are caught before Pod creation.pkg/operator/encryption/kms/health/assets/kms-health-container.yaml (1)
13-15:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a per-instance ConfigMap name (Line 15 currently causes writer collisions).
--configmap-name=kms-health-{{.KMSPluginName}}makes replicas of the same plugin write to the same ConfigMap, which breaks the monitor’s one-writer-per-instance expectation during rollouts.Suggested fix
args: - --kms-socket=/var/run/kmsplugin-{{.KMSPluginName}}/kms.sock - --probe-interval={{.ProbeInterval}} - --probe-interval-unhealthy={{.ProbeIntervalUnhealthy}} - --probe-timeout={{.ProbeTimeout}} - --write-timeout={{.WriteTimeout}} - --output-mode=configmap - --configmap-namespace={{.ConfigMapNamespace}} - - --configmap-name=kms-health-{{.KMSPluginName}}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml` around lines 13 - 15, The current flag --configmap-name uses kms-health-{{.KMSPluginName}} which causes all replicas to write the same ConfigMap; change the template to include a per-instance identifier (for example --configmap-name=kms-health-{{.KMSPluginName}}-{{.PodName}} or -{{.PodUID}}) so each pod/instance has its own ConfigMap; update the template variables used by the manifest (ensure {{.PodName}} or {{.PodUID}} is provided by the rendering context) and verify the monitor expects the new per-instance name.examples/kms-health-kind/manifests/static-pod.yaml (1)
16-16:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove unnecessary host network usage.
Line 16 enables host networking even though this harness coordinates via a Unix socket on the shared hostPath volume; this unnecessarily increases network exposure.
Suggested change
spec: - hostNetwork: true containers:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/kms-health-kind/manifests/static-pod.yaml` at line 16, Remove the unnecessary hostNetwork setting by deleting or setting hostNetwork: false in the Pod spec (the manifest key shown as "hostNetwork: true"); keep the hostPath volume that provides the Unix socket for coordination and ensure any container ports or network assumptions are adjusted to use the socket instead of host networking. Confirm the containers that relied on hostNetwork now bind to the shared hostPath socket and update any related readiness/liveness checks or env vars that previously referenced host IPs.examples/kms-health-kind/manifests/monitor-deployment.yaml (1)
20-53:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLock down pod/container security context defaults.
This deployment currently runs with default security context settings. Please harden both pod and container (non-root, seccomp runtime default, no privilege escalation, read-only root fs where compatible) to reduce unnecessary privilege in the harness.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Line 16: Remove the unnecessary hostNetwork setting by deleting or setting
hostNetwork: false in the Pod spec (the manifest key shown as "hostNetwork:
true"); keep the hostPath volume that provides the Unix socket for coordination
and ensure any container ports or network assumptions are adjusted to use the
socket instead of host networking. Confirm the containers that relied on
hostNetwork now bind to the shared hostPath socket and update any related
readiness/liveness checks or env vars that previously referenced host IPs.
In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml`:
- Around line 13-15: The current flag --configmap-name uses
kms-health-{{.KMSPluginName}} which causes all replicas to write the same
ConfigMap; change the template to include a per-instance identifier (for example
--configmap-name=kms-health-{{.KMSPluginName}}-{{.PodName}} or -{{.PodUID}}) so
each pod/instance has its own ConfigMap; update the template variables used by
the manifest (ensure {{.PodName}} or {{.PodUID}} is provided by the rendering
context) and verify the monitor expects the new per-instance name.
In `@pkg/operator/encryption/kms/health/container_template.go`:
- Around line 100-102: The current check in the validation for OperatorCommand
(the block that only checks len(o.OperatorCommand) == 0) allows empty or
whitespace-only entries; update the validation in the same function (the
OperatorCommand validation in container_template.go) to iterate over
o.OperatorCommand, trim whitespace for each element (e.g., strings.TrimSpace)
and return an error if any element is empty after trimming, including the
offending index/value in the error message so callers know which command entry
is invalid.
- Around line 92-96: The validate method on ContainerOptions currently only
checks KMSPluginName; update ContainerOptions.validate to also construct the
derived resource names (e.g., "kms-health-monitor-"+o.KMSPluginName and
"kms-plugin-socket-"+o.KMSPluginName and any other names used by this package)
and run the same DNS/length validation (use validation.IsDNS1123Label and/or
explicit length checks) on each derived name, returning a clear fmt.Errorf if
any derived name is invalid and naming which derived name failed; reference the
ContainerOptions.validate function and the KMSPluginName field when implementing
this additional validation so failures are caught before Pod creation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8730aa2e-bcd1-4d84-ab7e-e82b3679f449
📒 Files selected for processing (20)
examples/kms-health-kind/Dockerfile.fake-pluginexamples/kms-health-kind/Dockerfile.monitorexamples/kms-health-kind/Makefileexamples/kms-health-kind/bin/down.shexamples/kms-health-kind/bin/up.shexamples/kms-health-kind/cmd/monitor/main.goexamples/kms-health-kind/cmd/render-container/main.goexamples/kms-health-kind/fake-plugin/main.goexamples/kms-health-kind/manifests/encryption-config.yamlexamples/kms-health-kind/manifests/kind.yamlexamples/kms-health-kind/manifests/monitor-deployment.yamlexamples/kms-health-kind/manifests/namespace.yamlexamples/kms-health-kind/manifests/rbac.yamlexamples/kms-health-kind/manifests/static-pod.yamlexamples/kms-health-kind/verify.shpkg/operator/encryption/kms/health/assets/kms-health-container.yamlpkg/operator/encryption/kms/health/bindata.gopkg/operator/encryption/kms/health/container_template.gopkg/operator/encryption/kms/health/container_template_test.gopkg/operator/encryption/kms/health/writer_configmap.go
✅ Files skipped from review due to trivial changes (7)
- examples/kms-health-kind/manifests/namespace.yaml
- examples/kms-health-kind/manifests/kind.yaml
- examples/kms-health-kind/manifests/encryption-config.yaml
- pkg/operator/encryption/kms/health/bindata.go
- examples/kms-health-kind/bin/down.sh
- examples/kms-health-kind/cmd/render-container/main.go
- examples/kms-health-kind/manifests/rbac.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
- examples/kms-health-kind/Dockerfile.monitor
- examples/kms-health-kind/Dockerfile.fake-plugin
- examples/kms-health-kind/bin/up.sh
- pkg/operator/encryption/kms/health/writer_configmap.go
- examples/kms-health-kind/fake-plugin/main.go
|
@ibihim: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What
Create package with cobra command to run a health monitor for KMS plugin.
Why
We need some introspection into the state of the KMS plugin.
Preparing to create proper signaling for users.
Notes
Output:
{ "not-ok": { "timestamp": "2026-04-29T21:48:46Z", "observerPod": "kms-health-monitor-55789c9cc6-62vw7", "detail": "unhealthy:test-forced", "keyIDHash": "127fb4517dfd...77458d3" }, "ok": { "timestamp": "2026-04-29T22:01:43Z", "observerPod": "kms-health-monitor-55789c9cc6-62vw7", "keyIDHash": "127fb4517dfd...77458d3" }, "rpc-error": { "timestamp": "2026-04-29T21:49:04Z", "observerPod": "kms-health-monitor-55789c9cc6-62vw7", "detail": "DeadlineExceeded" } }Summary by CodeRabbit
New Features
Documentation
Tests