Skip to content

[WIP] CNTRLPLANE-3234: health monitor#2193

Open
ibihim wants to merge 3 commits into
openshift:masterfrom
ibihim:CNTRLPLANE-3234-health-monitor
Open

[WIP] CNTRLPLANE-3234: health monitor#2193
ibihim wants to merge 3 commits into
openshift:masterfrom
ibihim:CNTRLPLANE-3234-health-monitor

Conversation

@ibihim
Copy link
Copy Markdown
Contributor

@ibihim ibihim commented Apr 30, 2026

What

Create package with cobra command to run a health monitor for KMS plugin.

Why

We need some introspection into the state of the KMS plugin.
Preparing to create proper signaling for users.

Notes

kubectl --context kind-kms-health-claude -n kms-health-test \
  get cm kms-health-status -o json | jq '.data | map_values(fromjson)'

Output:

{                    
  "not-ok": {
    "timestamp": "2026-04-29T21:48:46Z",                                                                                                   
    "observerPod": "kms-health-monitor-55789c9cc6-62vw7",
    "detail": "unhealthy:test-forced",                                                                                                     
    "keyIDHash": "127fb4517dfd...77458d3"                                                                                                  
  },                 
  "ok": {                                                                                                                                  
    "timestamp": "2026-04-29T22:01:43Z",                                                                                                 
    "observerPod": "kms-health-monitor-55789c9cc6-62vw7",                                                                                  
    "keyIDHash": "127fb4517dfd...77458d3"
  },                                                                                                                                       
  "rpc-error": {                                                                                                                         
    "timestamp": "2026-04-29T21:49:04Z",                                                                                                   
    "observerPod": "kms-health-monitor-55789c9cc6-62vw7",                                                                                
    "detail": "DeadlineExceeded"                                                                                                           
  }
}                                

Summary by CodeRabbit

  • New Features

    • KMS v2 health monitor and companion CLI + container template that publish classified, timestamped health statuses to ConfigMaps.
  • Documentation

    • Complete example KIND-based harness with manifests, runnable scripts, Makefile, fake plugin, and render tooling.
  • Tests

    • New unit and integration tests covering probe, monitor, writer, and container-generation behaviors.

Introduces pkg/operator/encryption/kms/health/:

- Probe calls Status() on an injected kmsservice.Service (mirrors
  preflight's pattern) and classifies the response as ok,
  unhealthy:not-ok:<truncated-body>, or
  unhealthy:rpc-error:<gRPC-code>. Never returns an error; failures
  are baked into HealthStatus.Healthz so the monitor loop stays flat.

- StatusWriter / ConfigMapWriter updates an existing ConfigMap in
  place. Hard-fails on miss so misconfiguration is loud.

- Monitor drives the tick loop: probe, stamp observer pod, write,
  sleep on healthy/unhealthy interval, repeat until ctx cancel.
  Write errors are logged and tolerated.

- health.NewCommand(ctx) wires flags, validates inputs, builds a
  rest.Config, dials the UDS via the same kmsv2 client preflight
  uses, and hands a Probe + ConfigMapWriter to the Monitor.

--output-mode=condition is reserved and explicitly rejected so
  the OpenShift track can add it without silent behavior change.
  Tests use FakeClock with the HasWaiters() barrier for race-free
  stepping through healthy->unhealthy->healthy transitions.
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 30, 2026

@ibihim: This pull request references CNTRLPLANE-3234 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What

Create package with cobra command to run a health monitor for KMS plugin.

Why

We need some introspection into the state of the KMS plugin.
Preparing to create proper signaling for users.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 30, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Walkthrough

Adds a KMSv2 health-monitoring subsystem: probe types and monitor loop, ConfigMap writer and container template generator, CLI command, tests, and a KIND-based test harness including a fake KMS plugin, manifests, build scripts, and verification tooling.

Changes

Cohort / File(s) Summary
Core health package
pkg/operator/encryption/kms/health/types.go, probe.go, monitor.go, writer.go, writer_configmap.go, container_template.go, bindata.go, doc.go
New health domain types, Probe wrapper for KMS Status RPC, Monitor loop that probes and writes statuses, StatusWriter interface and ConfigMap writer implementation, and template-driven container generation with embedded assets.
CLI / runtime
pkg/operator/encryption/kms/health/cmd.go, examples/kms-health-kind/cmd/monitor/main.go, examples/kms-health-kind/cmd/render-container/main.go
Adds a Cobra command to run the monitor with configurable socket/timings/output and two small CLI utilities: monitor binary entrypoint and a render-container helper.
Tests
pkg/operator/encryption/kms/health/*_test.go (all test files), pkg/operator/encryption/kms/health/container_template_test.go
Comprehensive unit tests for Probe, Monitor, ConfigMap writer, CLI option validation, container template generation, and goroutine/leak/time deterministic behaviors.
KIND harness images & build
examples/kms-health-kind/Dockerfile.fake-plugin, examples/kms-health-kind/Dockerfile.monitor, examples/kms-health-kind/Makefile
Dockerfiles and Makefile to build fake-plugin and monitor images, load into KIND, and orchestrate harness actions.
KIND harness scripts
examples/kms-health-kind/bin/up.sh, examples/kms-health-kind/bin/down.sh, examples/kms-health-kind/verify.sh
Cluster lifecycle scripts: bring KIND up/down, load images, apply manifests, flip health states via host markers, and an acceptance test script that validates health transitions and timestamps.
Fake KMS plugin
examples/kms-health-kind/fake-plugin/main.go
Implements a disposable KMSv2-compatible gRPC service reporting status (driven by filesystem markers) and AES-256-GCM Encrypt/Decrypt using an in-code key; supports stay-down marker and socket management.
KIND manifests
examples/kms-health-kind/manifests/* (e.g., encryption-config.yaml, kind.yaml, monitor-deployment.yaml, namespace.yaml, rbac.yaml, static-pod.yaml)
Kubernetes manifests for the test harness: KIND cluster config with static-pod mount, encryption-config pointing to plugin socket, namespace, RBAC, static pod for fake-plugin, and a Deployment for the monitor.
Assets
pkg/operator/encryption/kms/health/assets/kms-health-container.yaml, examples/kms-health-kind/Dockerfile.*
Embedded container YAML asset used by template generation and Dockerfiles for building runtime images.

Sequence Diagram

sequenceDiagram
    participant Monitor as Health Monitor
    participant Probe as Probe (RPC wrapper)
    participant KMS as Fake KMS Plugin (gRPC)
    participant Writer as ConfigMapWriter
    participant K8s as Kubernetes API

    Monitor->>Probe: Probe(ctx)
    Probe->>KMS: Status() RPC
    alt Status OK
        KMS-->>Probe: {Healthz: "ok", KeyID: "..."}
        Probe-->>Monitor: HealthStatus{Healthz: OK, KeyIDHash: "...", Timestamp: t}
    else Status NotOK or RPC error
        KMS-->>Probe: error / non-ok body
        Probe-->>Monitor: HealthStatus{Healthz: NotOK|RPCError, Detail: "...", Timestamp: t}
    end
    Monitor->>Writer: Write(ctx, HealthStatus)
    Writer->>K8s: Patch/Create ConfigMap data[<class>]
    K8s-->>Writer: Success
    Monitor->>Monitor: sleep(healthyInterval|unhealthyInterval)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Stable And Deterministic Test Names ❓ Inconclusive Test files referenced in PR summary cannot be located in repository; unclear if Ginkgo testing framework is used. Verify test files are included in PR and confirm repository uses Ginkgo; re-run check with access to committed test code.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title references CNTRLPLANE-3234 and mentions 'health monitor', which aligns with the changeset's primary focus of adding a health monitoring system for KMS plugins.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Test Structure And Quality ✅ Passed Tests use standard Go testing.T framework with excellent quality: single responsibility, proper setup/cleanup, comprehensive timeouts, and descriptive assertions matching codebase patterns.
Microshift Test Compatibility ✅ Passed This PR does not introduce any Ginkgo e2e tests; all test files are standard Go unit tests using the testing package, and verify.sh is a Bash acceptance test script, not a Ginkgo e2e test.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request does not add any Ginkgo e2e tests. All test files use the standard Go testing framework with testing.T, and none import the Ginkgo framework.
Topology-Aware Scheduling Compatibility ✅ Passed The PR adds a reusable health monitoring library with no topology-unfriendly scheduling constraints in production code.
Ote Binary Stdout Contract ✅ Passed The monitor binary complies with OTE Binary Stdout Contract; klog v2 writes to stderr by default, error messages explicitly use os.Stderr, and no non-JSON content goes to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This custom check is not applicable to the provided pull request. The PR only adds standard Go unit tests, not Ginkgo e2e tests.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ibihim
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from dgrisonnet and p0lyn0mial April 30, 2026 09:27
// Cancel inline rather than via defer: defers in unbounded loops
// accumulate until the function returns.
writeCtx, writeCancel := context.WithTimeout(ctx, m.writeTimeout)
err := m.writer.Write(writeCtx, status)
Copy link
Copy Markdown
Contributor Author

@ibihim ibihim Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • Compare hysteresis like pkg/monitor/health?
  • (nit) IIFE better than non-defer cancel?

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
examples/kms-health-kind/manifests/static-pod.yaml (1)

17-28: ⚡ Quick win

Harden pod/container security context for the harness.

Add explicit securityContext to disable privilege escalation and lock down defaults.

Suggested change
 spec:
+  securityContext:
+    seccompProfile:
+      type: RuntimeDefault
   containers:
   - name: fake-plugin
     image: kms-health-kind-fake-plugin:dev
     imagePullPolicy: Never
+    securityContext:
+      allowPrivilegeEscalation: false
+      readOnlyRootFilesystem: true
+      runAsNonRoot: true
+      capabilities:
+        drop: ["ALL"]
     volumeMounts:
     - name: sock-tmp
       mountPath: /tmp
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/static-pod.yaml` around lines 17 - 28, Add
explicit pod- and container-level securityContext entries for the fake-plugin
container to disable privilege escalation and tighten defaults: under the pod
spec add a podSecurityContext with runAsNonRoot: true (and optionally runAsUser
and fsGroup), and for the container named "fake-plugin" add a securityContext
with allowPrivilegeEscalation: false, readOnlyRootFilesystem: true,
runAsNonRoot: true, capabilities: { drop: ["ALL"] }, and seccompProfile: { type:
"RuntimeDefault" } so the container and mounted volume "sock-tmp" are locked
down.
examples/kms-health-kind/manifests/rbac.yaml (1)

16-19: ⚡ Quick win

Align Role verbs with writer operations (patch + create).

Current Role grants update but the writer path uses patch and may create on NotFound (pkg/operator/encryption/kms/health/writer_configmap.go, Line 74-99). Add create and drop unused update to avoid fallback failures and keep least privilege tight.

Suggested change
-  verbs: ["get", "update", "patch"]
+  verbs: ["get", "patch", "create"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/rbac.yaml` around lines 16 - 19, The Role
currently grants verbs ["get","update","patch"] for resourceNames
["kms-health-status"]; update it to reflect the writer behavior in
pkg/operator/encryption/kms/health/writer_configmap.go (lines ~74-99) by
removing "update" and adding "create" so the verbs become
["get","patch","create"], ensuring the writer can patch or create on NotFound
while keeping least privilege.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/kms-health-kind/fake-plugin/main.go`:
- Around line 118-124: The shutdown select should not call log.Fatalf on server
exit because a normal SIGTERM can race with the serverErr case; update the
select handling around sig and serverErr so that when err from the serverErr
channel is http.ErrServerClosed (or shutdown was triggered by sig) you treat it
as a normal shutdown (log.Info/Print) instead of exiting non-zero, otherwise log
the unexpected error; use the existing serverErr, sig, ListenAndServe and
server.Shutdown symbols to implement this check and ensure server.Shutdown() is
still called for graceful termination.

In `@examples/kms-health-kind/manifests/configmap.yaml`:
- Around line 1-2: The comment in
examples/kms-health-kind/manifests/configmap.yaml is outdated: it claims the
monitor's ConfigMapWriter "hard-fails" if the ConfigMap is missing, but
writer_configmap.go implements a create fallback on NotFound; update the comment
to reflect that absence is tolerated because ConfigMapWriter (see
ConfigMapWriter in pkg/operator/encryption/kms/health/writer_configmap.go) will
attempt to create the ConfigMap on NotFound rather than hard-fail, so reword the
note to state the ConfigMap is pre-created for convenience but the writer will
create it if missing.

In `@examples/kms-health-kind/manifests/monitor-deployment.yaml`:
- Around line 19-58: The pod lacks an explicit securityContext and the monitor
container has no container-level security hardening; add a pod-level
spec.securityContext and a containers[ name: "monitor" ].securityContext to
enforce non-root and immutable filesystem policies: set runAsNonRoot: true (and
runAsUser to a non-root UID if known), set readOnlyRootFilesystem: true, set
allowPrivilegeEscalation: false, drop all capabilities and add a conservative
seccompProfile (e.g., RuntimeDefault) and/or nonRootGroup as appropriate; ensure
these changes work with the existing volumeMount sock-tmp (which is already
readOnly: true) and apply the policies to the "monitor" container and pod
spec.securityContext.

In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Line 16: The pod manifest currently sets hostNetwork: true which unnecessarily
broadens network exposure for a pod that communicates via a Unix socket under
/tmp; remove the hostNetwork: true line (or set it to false) in the
static-pod.yaml Pod spec so the Pod uses the cluster networking namespace
instead, ensuring the pod still communicates over the /tmp Unix socket without
host network access.

In `@pkg/operator/encryption/kms/health/cmd.go`:
- Around line 143-145: validate() currently only checks writeTimeout > 0 and
does not enforce the documented cadence budget; update validate() (the method
that checks o.writeTimeout and o.probeTimeout) to also verify that
o.probeTimeout is positive and that the sum o.probeTimeout + o.writeTimeout is
strictly less than probeIntervalUnhealthy (as documented in monitor.go). If the
sum violates the budget, return a clear fmt.Errorf that the combined
probeTimeout+writeTimeout must be below probeIntervalUnhealthy so callers cannot
configure values that break the fast-retry unhealthy cadence.
- Line 207: The probe is constructed with a hardcoded 0 timeout which prevents
the monitor loop from applying the configured deadline to Probe.Status(); change
the NewProbe call (currently NewProbe(service, 0)) to pass the validated probe
timeout variable used when dialing the KMS client (the same "--probe-timeout"
value/variable) so the probe enforces that deadline; update the call site where
NewProbe(service, 0) is invoked to NewProbe(service,
<configuredProbeTimeoutVariable>) so Status() honors the configured timeout.

In `@pkg/operator/encryption/kms/health/monitor_test.go`:
- Around line 129-130: The test currently does a blocking receive on done after
calling cancel(), which can hang if monitor.Run fails to observe the
cancellation; change the shutdown wait to a timeout-backed select (as used in
other tests): after calling cancel(), wait on either <-done or a time.After
timeout and fail the test (t.Fatalf / t.Fatal) on timeout so the test surfaces
failures instead of blocking; update the test around the Run invocation and uses
of cancel() and done to implement this timed select.

---

Nitpick comments:
In `@examples/kms-health-kind/manifests/rbac.yaml`:
- Around line 16-19: The Role currently grants verbs ["get","update","patch"]
for resourceNames ["kms-health-status"]; update it to reflect the writer
behavior in pkg/operator/encryption/kms/health/writer_configmap.go (lines
~74-99) by removing "update" and adding "create" so the verbs become
["get","patch","create"], ensuring the writer can patch or create on NotFound
while keeping least privilege.

In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Around line 17-28: Add explicit pod- and container-level securityContext
entries for the fake-plugin container to disable privilege escalation and
tighten defaults: under the pod spec add a podSecurityContext with runAsNonRoot:
true (and optionally runAsUser and fsGroup), and for the container named
"fake-plugin" add a securityContext with allowPrivilegeEscalation: false,
readOnlyRootFilesystem: true, runAsNonRoot: true, capabilities: { drop: ["ALL"]
}, and seccompProfile: { type: "RuntimeDefault" } so the container and mounted
volume "sock-tmp" are locked down.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 18bb6413-9d46-45e1-887d-b17e07b4b5f1

📥 Commits

Reviewing files that changed from the base of the PR and between d45f1f2 and 20d990f.

📒 Files selected for processing (26)
  • examples/kms-health-kind/Dockerfile.fake-plugin
  • examples/kms-health-kind/Dockerfile.monitor
  • examples/kms-health-kind/Makefile
  • examples/kms-health-kind/bin/down.sh
  • examples/kms-health-kind/bin/up.sh
  • examples/kms-health-kind/cmd/monitor/main.go
  • examples/kms-health-kind/fake-plugin/main.go
  • examples/kms-health-kind/manifests/configmap.yaml
  • examples/kms-health-kind/manifests/encryption-config.yaml
  • examples/kms-health-kind/manifests/kind.yaml
  • examples/kms-health-kind/manifests/monitor-deployment.yaml
  • examples/kms-health-kind/manifests/namespace.yaml
  • examples/kms-health-kind/manifests/rbac.yaml
  • examples/kms-health-kind/manifests/static-pod.yaml
  • examples/kms-health-kind/verify.sh
  • pkg/operator/encryption/kms/health/cmd.go
  • pkg/operator/encryption/kms/health/cmd_test.go
  • pkg/operator/encryption/kms/health/doc.go
  • pkg/operator/encryption/kms/health/monitor.go
  • pkg/operator/encryption/kms/health/monitor_test.go
  • pkg/operator/encryption/kms/health/probe.go
  • pkg/operator/encryption/kms/health/probe_test.go
  • pkg/operator/encryption/kms/health/types.go
  • pkg/operator/encryption/kms/health/writer.go
  • pkg/operator/encryption/kms/health/writer_configmap.go
  • pkg/operator/encryption/kms/health/writer_configmap_test.go

Comment on lines +118 to +124
select {
case <-sig:
log.Print("fake-plugin: signal received, shutting down")
case err := <-serverErr:
log.Fatalf("fake-plugin: server exited: %v", err)
}
server.Shutdown()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't fatal on the shutdown path.

ListenAndServe() is observed on a separate goroutine, so a normal SIGTERM can race the serverErr case. log.Fatalf will then exit non-zero even when the plugin is stopping intentionally, which makes the KIND harness teardown flaky.

🔧 Suggested fix
  select {
  case <-sig:
    log.Print("fake-plugin: signal received, shutting down")
  case err := <-serverErr:
-   log.Fatalf("fake-plugin: server exited: %v", err)
+   if err != nil {
+     log.Fatalf("fake-plugin: server exited: %v", err)
+   }
+   return
  }
  server.Shutdown()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
select {
case <-sig:
log.Print("fake-plugin: signal received, shutting down")
case err := <-serverErr:
log.Fatalf("fake-plugin: server exited: %v", err)
}
server.Shutdown()
select {
case <-sig:
log.Print("fake-plugin: signal received, shutting down")
case err := <-serverErr:
if err != nil {
log.Fatalf("fake-plugin: server exited: %v", err)
}
return
}
server.Shutdown()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/fake-plugin/main.go` around lines 118 - 124, The
shutdown select should not call log.Fatalf on server exit because a normal
SIGTERM can race with the serverErr case; update the select handling around sig
and serverErr so that when err from the serverErr channel is
http.ErrServerClosed (or shutdown was triggered by sig) you treat it as a normal
shutdown (log.Info/Print) instead of exiting non-zero, otherwise log the
unexpected error; use the existing serverErr, sig, ListenAndServe and
server.Shutdown symbols to implement this check and ensure server.Shutdown() is
still called for graceful termination.

Comment thread examples/kms-health-kind/manifests/configmap.yaml Outdated
Comment on lines +19 to +58
spec:
serviceAccountName: kms-health-monitor
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: monitor
image: kms-health-kind-monitor:dev
imagePullPolicy: Never
# Harness overrides the committed production numbers so
# verify.sh completes in a reasonable wall-clock budget.
args:
- --kms-socket=/tmp/kms.sock
# Harness intervals are tighter than production (60s/10s/3s)
# so verify.sh completes in a reasonable wall-clock budget.
# Assertion 6 (rpc-error detection) does not depend on these:
# fake-plugin's staydown marker forces a 3s outage on restart,
# which is long enough for any reasonable probe-timeout.
- --probe-interval=2s
- --probe-interval-unhealthy=1s
- --probe-timeout=1s
- --configmap-namespace=kms-health-test
- --configmap-name=kms-health-status
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: sock-tmp
mountPath: /tmp
readOnly: true
volumes:
- name: sock-tmp
hostPath:
path: /tmp
type: Directory
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lock down the monitor pod security context.

This pod currently inherits the default security context, so it can run with a writable root filesystem and without an explicit no-escalation guarantee. The container doesn't need either, so tightening this should be a small change.

🔒 Suggested hardening
     spec:
+      securityContext:
+        runAsNonRoot: true
+        seccompProfile:
+          type: RuntimeDefault
       serviceAccountName: kms-health-monitor
       nodeSelector:
         node-role.kubernetes.io/control-plane: ""
@@
       containers:
       - name: monitor
         image: kms-health-kind-monitor:dev
         imagePullPolicy: Never
+        securityContext:
+          allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: true
🧰 Tools
🪛 Checkov (3.2.525)

[medium] 5-58: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 5-58: Minimize the admission of root containers

(CKV_K8S_23)

🪛 Trivy (0.69.3)

[error] 28-53: Root file system is not read-only

Container 'monitor' of Deployment 'kms-health-monitor' should set 'securityContext.readOnlyRootFilesystem' to true

Rule: KSV-0014

Learn more

(IaC/Kubernetes)


[error] 28-53: Default security context configured

container kms-health-monitor in kms-health-test namespace is using the default security context

Rule: KSV-0118

Learn more

(IaC/Kubernetes)


[error] 19-58: Default security context configured

deployment kms-health-monitor in kms-health-test namespace is using the default security context, which allows root privileges

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/monitor-deployment.yaml` around lines 19 -
58, The pod lacks an explicit securityContext and the monitor container has no
container-level security hardening; add a pod-level spec.securityContext and a
containers[ name: "monitor" ].securityContext to enforce non-root and immutable
filesystem policies: set runAsNonRoot: true (and runAsUser to a non-root UID if
known), set readOnlyRootFilesystem: true, set allowPrivilegeEscalation: false,
drop all capabilities and add a conservative seccompProfile (e.g.,
RuntimeDefault) and/or nonRootGroup as appropriate; ensure these changes work
with the existing volumeMount sock-tmp (which is already readOnly: true) and
apply the policies to the "monitor" container and pod spec.securityContext.

tier: control-plane
component: kms-fake-plugin
spec:
hostNetwork: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove hostNetwork: true unless strictly required.

Line 16 broadens network exposure but this pod is coordinating through a Unix socket on /tmp, so host networking appears unnecessary.

Suggested change
 spec:
-  hostNetwork: true
   containers:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hostNetwork: true
spec:
containers:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/static-pod.yaml` at line 16, The pod
manifest currently sets hostNetwork: true which unnecessarily broadens network
exposure for a pod that communicates via a Unix socket under /tmp; remove the
hostNetwork: true line (or set it to false) in the static-pod.yaml Pod spec so
the Pod uses the cluster networking namespace instead, ensuring the pod still
communicates over the /tmp Unix socket without host network access.

Comment on lines +143 to +145
if o.writeTimeout <= 0 {
return fmt.Errorf("--write-timeout must be positive")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the unhealthy cadence budget.

monitor.go documents that probeTimeout + writeTimeout must stay below probeIntervalUnhealthy, but validate() only checks that writeTimeout is positive. A caller can currently configure a write timeout that pushes the loop past the unhealthy cadence and defeats the fast-retry behavior.

♻️ Suggested fix
   if o.writeTimeout <= 0 {
     return fmt.Errorf("--write-timeout must be positive")
   }
+  if o.probeTimeout+o.writeTimeout >= o.probeIntervalUnhealthy {
+    return fmt.Errorf(
+      "--probe-timeout + --write-timeout must be smaller than --probe-interval-unhealthy",
+    )
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if o.writeTimeout <= 0 {
return fmt.Errorf("--write-timeout must be positive")
}
if o.writeTimeout <= 0 {
return fmt.Errorf("--write-timeout must be positive")
}
if o.probeTimeout+o.writeTimeout >= o.probeIntervalUnhealthy {
return fmt.Errorf(
"--probe-timeout + --write-timeout must be smaller than --probe-interval-unhealthy",
)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/cmd.go` around lines 143 - 145, validate()
currently only checks writeTimeout > 0 and does not enforce the documented
cadence budget; update validate() (the method that checks o.writeTimeout and
o.probeTimeout) to also verify that o.probeTimeout is positive and that the sum
o.probeTimeout + o.writeTimeout is strictly less than probeIntervalUnhealthy (as
documented in monitor.go). If the sum violates the budget, return a clear
fmt.Errorf that the combined probeTimeout+writeTimeout must be below
probeIntervalUnhealthy so callers cannot configure values that break the
fast-retry unhealthy cadence.

if err != nil {
return fmt.Errorf("dial KMS plugin at %q: %w", endpoint, err)
}
probe := NewProbe(service, 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass the configured probe timeout into NewProbe.

--probe-timeout is validated and used when dialing the KMS client, but the probe itself is still constructed with 0, so the monitor loop never applies the configured deadline to Status(). If the plugin hangs, the monitor can stall indefinitely and the timeout flag becomes misleading.

🔧 Suggested fix
-  probe := NewProbe(service, 0)
+  probe := NewProbe(service, o.probeTimeout)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
probe := NewProbe(service, 0)
probe := NewProbe(service, o.probeTimeout)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/cmd.go` at line 207, The probe is
constructed with a hardcoded 0 timeout which prevents the monitor loop from
applying the configured deadline to Probe.Status(); change the NewProbe call
(currently NewProbe(service, 0)) to pass the validated probe timeout variable
used when dialing the KMS client (the same "--probe-timeout" value/variable) so
the probe enforces that deadline; update the call site where NewProbe(service,
0) is invoked to NewProbe(service, <configuredProbeTimeoutVariable>) so Status()
honors the configured timeout.

Comment on lines +129 to +130
cancel()
<-done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound the shutdown wait to avoid hanging the test run.

Line 130 blocks indefinitely if Run fails to observe cancellation. Use a timeout-backed select (as done in the other tests) so failures surface quickly instead of stalling CI.

Suggested fix
 	cancel()
-	<-done
+	select {
+	case <-done:
+	case <-time.After(2 * time.Second):
+		t.Fatal("Monitor did not stop within 2s of context cancel")
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cancel()
<-done
cancel()
select {
case <-done:
case <-time.After(2 * time.Second):
t.Fatal("Monitor did not stop within 2s of context cancel")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/monitor_test.go` around lines 129 - 130,
The test currently does a blocking receive on done after calling cancel(), which
can hang if monitor.Run fails to observe the cancellation; change the shutdown
wait to a timeout-backed select (as used in other tests): after calling
cancel(), wait on either <-done or a time.After timeout and fail the test
(t.Fatalf / t.Fatal) on timeout so the test surfaces failures instead of
blocking; update the test around the Run invocation and uses of cancel() and
done to implement this timed select.

Copy link
Copy Markdown
Member

@ardaguclu ardaguclu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not ready for review yet. Nevertheless I dropped a comment that can be useful. Also design needs to go here https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/kms-encryption-foundations.md

@@ -0,0 +1,238 @@
package health
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this package is supposed to do is very similar to https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/kms/preflight/checker.go. I think it is better to mimic from it.

The only difference is health monitor will periodically write the status to somewhere.

Copy link
Copy Markdown
Contributor Author

@ibihim ibihim Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the differences? Both are cobra commands. They have flags, validations and a run method.

I create a reader, a writer and put them into a monitor. Writer is an interface, because I want to pivot to operator conditions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems there is room for reuse. it would be better to have a single binary with custom logic.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
examples/kms-health-kind/fake-plugin/main.go (1)

118-124: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't fatal on the normal shutdown path.

serverErr can race with the SIGTERM/SIGINT branch, so a clean shutdown can still hit log.Fatalf and make the KIND harness exit non-zero.

Suggested fix
  select {
  case <-sig:
    log.Print("fake-plugin: signal received, shutting down")
  case err := <-serverErr:
-   log.Fatalf("fake-plugin: server exited: %v", err)
+   if err != nil {
+     log.Fatalf("fake-plugin: server exited: %v", err)
+   }
+   return
  }
  server.Shutdown()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/fake-plugin/main.go` around lines 118 - 124, The
select currently calls log.Fatalf on any value from serverErr which can race
with the sig shutdown path; change the handling in main so you do not call
log.Fatalf on the normal shutdown path: replace log.Fatalf("fake-plugin: server
exited: %v", err) with a non-fatal log (e.g., log.Printf) and only treat
non-nil, unexpected errors as fatal if appropriate, then call server.Shutdown()
and drain/wait for serverErr to ensure the server goroutine has finished. Update
the select case for err := <-serverErr and the shutdown sequence around
server.Shutdown() so normal SIGTERM/SIGINT leads to a clean, non-zero-exit-free
shutdown.
🧹 Nitpick comments (1)
examples/kms-health-kind/manifests/static-pod.yaml (1)

15-28: ⚡ Quick win

Add an explicit securityContext for this static pod.

This pod runs in kube-system and mounts a hostPath, but it still inherits the default root-capable security settings. Even for the kind harness, it is worth pinning this down with an explicit pod/container securityContext (seccompProfile: RuntimeDefault, allowPrivilegeEscalation: false, dropped capabilities, and runAsNonRoot if the image supports it).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/static-pod.yaml` around lines 15 - 28, Add
an explicit pod and container securityContext for the static pod (under spec and
the container named "fake-plugin"): set pod-level
securityContext.seccompProfile.type to "RuntimeDefault" and at the container
level set securityContext.allowPrivilegeEscalation=false,
securityContext.capabilities.drop=["ALL"], and securityContext.runAsNonRoot=true
(only enable runAsNonRoot if the image supports it); apply these to the spec and
the container named "fake-plugin" so the hostPath-mounted static pod in
kube-system runs with constrained privileges.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/kms-health-kind/bin/up.sh`:
- Around line 50-70: The script stops the fake-plugin static pod (FAKE_CID via
crictl stop) but then immediately applies the monitor Deployment, which can roll
out before the static pod/socket is ready; add a polling loop after docker exec
"${NODE}" crictl stop "${FAKE_CID}" that waits (with a short sleep and overall
timeout) for the fake-plugin container to reappear and be Running (e.g., poll
docker exec "${NODE}" crictl ps --name fake-plugin -q and check crictl inspect
--output go-template --template '{{.status.state}}' or use a socket probe like
test -S /path/to/socket inside the node) before proceeding to apply the
namespace/RBAC/monitor and running kubectl rollout status; reference FAKE_CID,
NODE, and the subsequent kubectl rollout status
deployment/kms-health-monitor-fake when adding the wait.

In `@examples/kms-health-kind/Makefile`:
- Around line 47-55: The Makefile invocation of render-container uses the
--command flag set to /usr/bin/kms-health-monitor which doesn't match the binary
location installed by Dockerfile.monitor (/monitor); update the --command
argument in the render-container call (replace
--command=/usr/bin/kms-health-monitor) to point to /monitor so the pod spec
references the actual binary path used in the image.

In `@examples/kms-health-kind/manifests/monitor-deployment.yaml`:
- Around line 28-53: The container definition for kms-health-monitor-fake
currently only specifies args, leaving the image entrypoint implicit; add an
explicit command stanza to match the shared template so the pod no longer
depends on the image entrypoint drift (locate the container with name:
kms-health-monitor-fake and add the same command array used by the shared
template above the args block), or replace this hand-copied container block with
the generated/shared template to ensure the command and args are kept in sync
with the harness expectations.

In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml`:
- Around line 13-15: The ConfigMap name is hard-coded as
--configmap-name=kms-health-{{.KMSPluginName}}, which violates ConfigMapWriter's
one-per-monitor-instance contract and causes writers to clobber each other;
change the template to include unique pod/instance identity (for example append
the pod name or UID templated value) or remove the --configmap-name flag so the
cmd-layer default can generate a per-instance name; update the kms-health
container args in kms-health-container.yaml accordingly (referencing the
--configmap-name flag and ConfigMapWriter/ConfigMapNamespace/ConfigMapName
symbols) so each sidecar writes to a distinct ConfigMap.

In `@pkg/operator/encryption/kms/health/container_template.go`:
- Around line 92-96: The current ContainerOptions.validate() only validates
KMSPluginName itself but not the derived resource names, causing Kubernetes to
reject pod/volume names like "kms-health-monitor-<name>" or
"kms-plugin-socket-<name>" when they exceed DNS-1123 limits; update
ContainerOptions.validate() to construct the derived names (e.g.,
"kms-health-monitor-"+o.KMSPluginName and "kms-plugin-socket-"+o.KMSPluginName)
and run validation.IsDNS1123Label() (or check length <= 63 and allowed chars) on
each, returning a clear fmt.Errorf referencing the failing derived name if
validation fails so podspec generation won’t produce invalid names.
- Around line 100-102: The current check only ensures o.OperatorCommand has a
non-zero length but allows empty/whitespace elements; update the validation
where OperatorCommand is checked to iterate over each element, trim whitespace
(strings.TrimSpace) and reject any empty entries by returning an error (e.g.,
fmt.Errorf) that names the invalid element(s) or their indices; ensure the check
happens before rendering/using OperatorCommand so values like "" or " " (or
malformed CSV-split inputs) are rejected rather than producing an invalid
container entrypoint.

In `@pkg/operator/encryption/kms/health/writer_configmap.go`:
- Around line 61-66: The Timestamp stored in classEntry currently uses
status.Timestamp.UTC().Format(time.RFC3339) which drops fractional seconds;
update the serialization in the classEntry construction (field Timestamp) to
preserve sub-second precision by formatting with time.RFC3339Nano or by writing
a numeric epoch (e.g., status.Timestamp.UTC().UnixNano()) so rapid transitions
within the same second remain totally ordered when readers compare timestamps.

---

Duplicate comments:
In `@examples/kms-health-kind/fake-plugin/main.go`:
- Around line 118-124: The select currently calls log.Fatalf on any value from
serverErr which can race with the sig shutdown path; change the handling in main
so you do not call log.Fatalf on the normal shutdown path: replace
log.Fatalf("fake-plugin: server exited: %v", err) with a non-fatal log (e.g.,
log.Printf) and only treat non-nil, unexpected errors as fatal if appropriate,
then call server.Shutdown() and drain/wait for serverErr to ensure the server
goroutine has finished. Update the select case for err := <-serverErr and the
shutdown sequence around server.Shutdown() so normal SIGTERM/SIGINT leads to a
clean, non-zero-exit-free shutdown.

---

Nitpick comments:
In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Around line 15-28: Add an explicit pod and container securityContext for the
static pod (under spec and the container named "fake-plugin"): set pod-level
securityContext.seccompProfile.type to "RuntimeDefault" and at the container
level set securityContext.allowPrivilegeEscalation=false,
securityContext.capabilities.drop=["ALL"], and securityContext.runAsNonRoot=true
(only enable runAsNonRoot if the image supports it); apply these to the spec and
the container named "fake-plugin" so the hostPath-mounted static pod in
kube-system runs with constrained privileges.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d618721a-cbe2-4be7-821e-5d31f663463d

📥 Commits

Reviewing files that changed from the base of the PR and between 20d990f and 90eaad7.

📒 Files selected for processing (14)
  • examples/kms-health-kind/Makefile
  • examples/kms-health-kind/bin/up.sh
  • examples/kms-health-kind/cmd/render-container/main.go
  • examples/kms-health-kind/fake-plugin/main.go
  • examples/kms-health-kind/manifests/encryption-config.yaml
  • examples/kms-health-kind/manifests/monitor-deployment.yaml
  • examples/kms-health-kind/manifests/rbac.yaml
  • examples/kms-health-kind/manifests/static-pod.yaml
  • examples/kms-health-kind/verify.sh
  • pkg/operator/encryption/kms/health/assets/kms-health-container.yaml
  • pkg/operator/encryption/kms/health/bindata.go
  • pkg/operator/encryption/kms/health/container_template.go
  • pkg/operator/encryption/kms/health/container_template_test.go
  • pkg/operator/encryption/kms/health/writer_configmap.go
✅ Files skipped from review due to trivial changes (2)
  • examples/kms-health-kind/manifests/encryption-config.yaml
  • examples/kms-health-kind/manifests/rbac.yaml

Comment on lines +50 to +70
# 3) Force the static pod to restart so it picks up the freshly-loaded
# image (kubelet may have cached an ErrImageNeverPull from before).
echo "[up.sh] bouncing fake-plugin container so it picks up loaded image"
FAKE_CID="$(docker exec "${NODE}" crictl ps --name fake-plugin -q 2>/dev/null | head -1 || true)"
if [ -n "${FAKE_CID}" ]; then
docker exec "${NODE}" crictl stop "${FAKE_CID}" >/dev/null || true
fi

# 4) Apply namespace + RBAC + monitor Deployment. The status ConfigMap
# is created by the writer itself on first observation; nothing
# pre-creates it.
echo "[up.sh] applying namespace, RBAC, monitor Deployment"
kubectl --context "${CTX}" apply -f manifests/namespace.yaml
kubectl --context "${CTX}" apply -f manifests/rbac.yaml
kubectl --context "${CTX}" apply -f manifests/monitor-deployment.yaml

# 5) Wait for the monitor Deployment to be Available. Apiserver is NOT
# wired to KMS here (see manifests/kind.yaml) — we're only
# validating the monitor, not apiserver/KMS integration.
echo "[up.sh] waiting for monitor Deployment"
kubectl --context "${CTX}" -n kms-health-test rollout status deployment/kms-health-monitor-fake --timeout=120s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the fake plugin to come back before rolling out the monitor.

After crictl stop, the script immediately deploys the monitor and only waits for the Deployment to be Available. Because the monitor has no readiness gate on the socket, rollout can succeed while the static pod is still restarting, which makes the first observed health state racey and can flake the harness. Poll the static pod or socket path before continuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/bin/up.sh` around lines 50 - 70, The script stops
the fake-plugin static pod (FAKE_CID via crictl stop) but then immediately
applies the monitor Deployment, which can roll out before the static pod/socket
is ready; add a polling loop after docker exec "${NODE}" crictl stop
"${FAKE_CID}" that waits (with a short sleep and overall timeout) for the
fake-plugin container to reappear and be Running (e.g., poll docker exec
"${NODE}" crictl ps --name fake-plugin -q and check crictl inspect --output
go-template --template '{{.status.state}}' or use a socket probe like test -S
/path/to/socket inside the node) before proceeding to apply the
namespace/RBAC/monitor and running kubectl rollout status; reference FAKE_CID,
NODE, and the subsequent kubectl rollout status
deployment/kms-health-monitor-fake when adding the wait.

Comment thread examples/kms-health-kind/Makefile
Comment on lines +28 to +53
containers:
- name: kms-health-monitor-fake
image: kms-health-kind-monitor:dev
imagePullPolicy: Never
# Tighter intervals than library defaults so verify.sh fits in budget.
args:
- --kms-socket=/var/run/kmsplugin-fake/kms.sock
- --probe-interval=2s
- --probe-interval-unhealthy=1s
- --probe-timeout=1s
- --write-timeout=5s
- --output-mode=configmap
- --configmap-namespace=kms-health-test
- --configmap-name=kms-health-fake
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: kms-plugin-socket-fake
mountPath: /var/run/kmsplugin-fake
resources:
requests:
memory: 50Mi
cpu: 5m
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the monitor command explicit here too.

The shared template always renders a command, but this manifest only mirrors the args. That makes the harness depend on the image entrypoint matching the template and weakens the "drift makes kind-verify go red" guarantee in the header. Add the same command stanza here, or generate this block from the shared template instead of hand-copying it.

🧰 Tools
🪛 Trivy (0.69.3)

[error] 29-53: Root file system is not read-only

Container 'kms-health-monitor-fake' of Deployment 'kms-health-monitor-fake' should set 'securityContext.readOnlyRootFilesystem' to true

Rule: KSV-0014

Learn more

(IaC/Kubernetes)


[error] 29-53: Default security context configured

container kms-health-monitor-fake in kms-health-test namespace is using the default security context

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/monitor-deployment.yaml` around lines 28 -
53, The container definition for kms-health-monitor-fake currently only
specifies args, leaving the image entrypoint implicit; add an explicit command
stanza to match the shared template so the pod no longer depends on the image
entrypoint drift (locate the container with name: kms-health-monitor-fake and
add the same command array used by the shared template above the args block), or
replace this hand-copied container block with the generated/shared template to
ensure the command and args are kept in sync with the harness expectations.

Comment on lines +13 to +15
- --output-mode=configmap
- --configmap-namespace={{.ConfigMapNamespace}}
- --configmap-name=kms-health-{{.KMSPluginName}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't hard-code a shared ConfigMap name here.

ConfigMapWriter's concurrency contract requires one ConfigMap per monitor instance, but this template makes every sidecar for the same plugin write kms-health-{{.KMSPluginName}}. During rollouts or multiple replicas, those writers will clobber each other's class keys and observerPod. Either encode pod identity in the name here or let the cmd-layer default choose the per-instance name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml` around
lines 13 - 15, The ConfigMap name is hard-coded as
--configmap-name=kms-health-{{.KMSPluginName}}, which violates ConfigMapWriter's
one-per-monitor-instance contract and causes writers to clobber each other;
change the template to include unique pod/instance identity (for example append
the pod name or UID templated value) or remove the --configmap-name flag so the
cmd-layer default can generate a per-instance name; update the kms-health
container args in kms-health-container.yaml accordingly (referencing the
--configmap-name flag and ConfigMapWriter/ConfigMapNamespace/ConfigMapName
symbols) so each sidecar writes to a distinct ConfigMap.

Comment on lines +92 to +96
func (o ContainerOptions) validate() error {
if errs := validation.IsDNS1123Label(o.KMSPluginName); len(errs) > 0 {
return fmt.Errorf("KMSPluginName %q is not a valid DNS-1123 label: %s",
o.KMSPluginName, strings.Join(errs, "; "))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the derived names, not just KMSPluginName.

A 63-character plugin name passes IsDNS1123Label, but kms-health-monitor-<name> and kms-plugin-socket-<name> then exceed the 63-character DNS label limit for container and volume names. That means validation succeeds here and the PodSpec fails later when Kubernetes rejects those derived names.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/container_template.go` around lines 92 -
96, The current ContainerOptions.validate() only validates KMSPluginName itself
but not the derived resource names, causing Kubernetes to reject pod/volume
names like "kms-health-monitor-<name>" or "kms-plugin-socket-<name>" when they
exceed DNS-1123 limits; update ContainerOptions.validate() to construct the
derived names (e.g., "kms-health-monitor-"+o.KMSPluginName and
"kms-plugin-socket-"+o.KMSPluginName) and run validation.IsDNS1123Label() (or
check length <= 63 and allowed chars) on each, returning a clear fmt.Errorf
referencing the failing derived name if validation fails so podspec generation
won’t produce invalid names.

Comment on lines +100 to +102
if len(o.OperatorCommand) == 0 {
return fmt.Errorf("OperatorCommand must have at least one element")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject empty command elements.

OperatorCommand only checks len > 0, so callers can still pass []string{"", ...} or a value like --command=a,,b through the render CLI and emit an invalid entrypoint. Validate each element after trimming whitespace before rendering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/container_template.go` around lines 100 -
102, The current check only ensures o.OperatorCommand has a non-zero length but
allows empty/whitespace elements; update the validation where OperatorCommand is
checked to iterate over each element, trim whitespace (strings.TrimSpace) and
reject any empty entries by returning an error (e.g., fmt.Errorf) that names the
invalid element(s) or their indices; ensure the check happens before
rendering/using OperatorCommand so values like "" or " " (or malformed CSV-split
inputs) are rejected rather than producing an invalid container entrypoint.

Comment on lines +61 to +66
entry := classEntry{
Timestamp: status.Timestamp.UTC().Format(time.RFC3339),
ObserverPod: status.ObserverPod,
Detail: status.Healthz.Detail,
KeyIDHash: status.KeyIDHash,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve sub-second precision in the stored timestamp.

Readers pick the current class by comparing timestamps, but Format(time.RFC3339) drops fractional seconds. Two transitions inside the same second then serialize identically, so ok/not-ok/rpc-error ordering becomes ambiguous. RFC3339Nano or a numeric epoch would avoid that.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/writer_configmap.go` around lines 61 - 66,
The Timestamp stored in classEntry currently uses
status.Timestamp.UTC().Format(time.RFC3339) which drops fractional seconds;
update the serialization in the classEntry construction (field Timestamp) to
preserve sub-second precision by formatting with time.RFC3339Nano or by writing
a numeric epoch (e.g., status.Timestamp.UTC().UnixNano()) so rapid transitions
within the same second remain totally ordered when readers compare timestamps.

ibihim added 2 commits April 30, 2026 15:51
GenerateContainerTemplate renders a corev1.Container sidecar.
KMSPluginName suffixes container name, volumeMount, kms-socket path,
and configmap name so multi-plugin coexists in one pod.

Drive-by: ConfigMapWriter godoc updated to match the actual
patch-then-create flow.
Self-contained throwaway harness for validating the health package
against a real KIND cluster. Defaults to cluster name kms-health-claude.

  fake-plugin/             ~90-line k8s.io/kms/pkg/service.Service
                           with AES-GCM and a file-marker unhealthy
                           toggle.
  cmd/monitor/             wraps health.NewCommand.
  cmd/render-container/    wraps health.GenerateContainerTemplate;
                           the harness Deployment mirrors its output.
  manifests/               fake-plugin runs as a static pod in
                           kube-system; monitor runs as a Deployment
                           in kms-health-test pinned to the
                           control-plane node; both share the host's
                           /tmp bind-mounted at /var/run/kmsplugin-fake/
                           for the UDS. RBAC splits into two rules
                           since k8s ignores create on rules with
                           resourceNames.
  verify.sh                six end-to-end assertions.

Drop this commit before merging to keep examples/ out of the library.
@ibihim ibihim force-pushed the CNTRLPLANE-3234-health-monitor branch from 90eaad7 to b0d6ae6 Compare April 30, 2026 13:58
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (5)
pkg/operator/encryption/kms/health/container_template.go (2)

100-102: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject empty/whitespace OperatorCommand elements.

len(o.OperatorCommand) > 0 is not sufficient; values like []string{"", "kms-health-monitor"} pass validation and can produce invalid container command execution.

Suggested fix
 	if len(o.OperatorCommand) == 0 {
 		return fmt.Errorf("OperatorCommand must have at least one element")
 	}
+	for i, arg := range o.OperatorCommand {
+		if strings.TrimSpace(arg) == "" {
+			return fmt.Errorf("OperatorCommand[%d] must be non-empty", i)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/container_template.go` around lines 100 -
102, The current check in the validation for OperatorCommand (the block that
only checks len(o.OperatorCommand) == 0) allows empty or whitespace-only
entries; update the validation in the same function (the OperatorCommand
validation in container_template.go) to iterate over o.OperatorCommand, trim
whitespace for each element (e.g., strings.TrimSpace) and return an error if any
element is empty after trimming, including the offending index/value in the
error message so callers know which command entry is invalid.

92-96: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate derived Kubernetes names, not only KMSPluginName.

A plugin name can be DNS-1123 valid while derived names (e.g., kms-health-monitor-<name>, kms-plugin-socket-<name>) become invalid for label-length/name rules and fail later at Pod creation.

Suggested fix
 func (o ContainerOptions) validate() error {
 	if errs := validation.IsDNS1123Label(o.KMSPluginName); len(errs) > 0 {
 		return fmt.Errorf("KMSPluginName %q is not a valid DNS-1123 label: %s",
 			o.KMSPluginName, strings.Join(errs, "; "))
 	}
+	for _, derived := range []string{
+		"kms-health-monitor-" + o.KMSPluginName,
+		"kms-plugin-socket-" + o.KMSPluginName,
+	} {
+		if errs := validation.IsDNS1123Label(derived); len(errs) > 0 {
+			return fmt.Errorf("derived name %q is not a valid DNS-1123 label: %s",
+				derived, strings.Join(errs, "; "))
+		}
+	}
 	if o.OperatorImage == "" {
 		return fmt.Errorf("OperatorImage is required")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/container_template.go` around lines 92 -
96, The validate method on ContainerOptions currently only checks KMSPluginName;
update ContainerOptions.validate to also construct the derived resource names
(e.g., "kms-health-monitor-"+o.KMSPluginName and
"kms-plugin-socket-"+o.KMSPluginName and any other names used by this package)
and run the same DNS/length validation (use validation.IsDNS1123Label and/or
explicit length checks) on each derived name, returning a clear fmt.Errorf if
any derived name is invalid and naming which derived name failed; reference the
ContainerOptions.validate function and the KMSPluginName field when implementing
this additional validation so failures are caught before Pod creation.
pkg/operator/encryption/kms/health/assets/kms-health-container.yaml (1)

13-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a per-instance ConfigMap name (Line 15 currently causes writer collisions).

--configmap-name=kms-health-{{.KMSPluginName}} makes replicas of the same plugin write to the same ConfigMap, which breaks the monitor’s one-writer-per-instance expectation during rollouts.

Suggested fix
 args:
   - --kms-socket=/var/run/kmsplugin-{{.KMSPluginName}}/kms.sock
   - --probe-interval={{.ProbeInterval}}
   - --probe-interval-unhealthy={{.ProbeIntervalUnhealthy}}
   - --probe-timeout={{.ProbeTimeout}}
   - --write-timeout={{.WriteTimeout}}
   - --output-mode=configmap
   - --configmap-namespace={{.ConfigMapNamespace}}
-  - --configmap-name=kms-health-{{.KMSPluginName}}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml` around
lines 13 - 15, The current flag --configmap-name uses
kms-health-{{.KMSPluginName}} which causes all replicas to write the same
ConfigMap; change the template to include a per-instance identifier (for example
--configmap-name=kms-health-{{.KMSPluginName}}-{{.PodName}} or -{{.PodUID}}) so
each pod/instance has its own ConfigMap; update the template variables used by
the manifest (ensure {{.PodName}} or {{.PodUID}} is provided by the rendering
context) and verify the monitor expects the new per-instance name.
examples/kms-health-kind/manifests/static-pod.yaml (1)

16-16: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove unnecessary host network usage.

Line 16 enables host networking even though this harness coordinates via a Unix socket on the shared hostPath volume; this unnecessarily increases network exposure.

Suggested change
 spec:
-  hostNetwork: true
   containers:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/kms-health-kind/manifests/static-pod.yaml` at line 16, Remove the
unnecessary hostNetwork setting by deleting or setting hostNetwork: false in the
Pod spec (the manifest key shown as "hostNetwork: true"); keep the hostPath
volume that provides the Unix socket for coordination and ensure any container
ports or network assumptions are adjusted to use the socket instead of host
networking. Confirm the containers that relied on hostNetwork now bind to the
shared hostPath socket and update any related readiness/liveness checks or env
vars that previously referenced host IPs.
examples/kms-health-kind/manifests/monitor-deployment.yaml (1)

20-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lock down pod/container security context defaults.

This deployment currently runs with default security context settings. Please harden both pod and container (non-root, seccomp runtime default, no privilege escalation, read-only root fs where compatible) to reduce unnecessary privilege in the harness.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/kms-health-kind/manifests/static-pod.yaml`:
- Line 16: Remove the unnecessary hostNetwork setting by deleting or setting
hostNetwork: false in the Pod spec (the manifest key shown as "hostNetwork:
true"); keep the hostPath volume that provides the Unix socket for coordination
and ensure any container ports or network assumptions are adjusted to use the
socket instead of host networking. Confirm the containers that relied on
hostNetwork now bind to the shared hostPath socket and update any related
readiness/liveness checks or env vars that previously referenced host IPs.

In `@pkg/operator/encryption/kms/health/assets/kms-health-container.yaml`:
- Around line 13-15: The current flag --configmap-name uses
kms-health-{{.KMSPluginName}} which causes all replicas to write the same
ConfigMap; change the template to include a per-instance identifier (for example
--configmap-name=kms-health-{{.KMSPluginName}}-{{.PodName}} or -{{.PodUID}}) so
each pod/instance has its own ConfigMap; update the template variables used by
the manifest (ensure {{.PodName}} or {{.PodUID}} is provided by the rendering
context) and verify the monitor expects the new per-instance name.

In `@pkg/operator/encryption/kms/health/container_template.go`:
- Around line 100-102: The current check in the validation for OperatorCommand
(the block that only checks len(o.OperatorCommand) == 0) allows empty or
whitespace-only entries; update the validation in the same function (the
OperatorCommand validation in container_template.go) to iterate over
o.OperatorCommand, trim whitespace for each element (e.g., strings.TrimSpace)
and return an error if any element is empty after trimming, including the
offending index/value in the error message so callers know which command entry
is invalid.
- Around line 92-96: The validate method on ContainerOptions currently only
checks KMSPluginName; update ContainerOptions.validate to also construct the
derived resource names (e.g., "kms-health-monitor-"+o.KMSPluginName and
"kms-plugin-socket-"+o.KMSPluginName and any other names used by this package)
and run the same DNS/length validation (use validation.IsDNS1123Label and/or
explicit length checks) on each derived name, returning a clear fmt.Errorf if
any derived name is invalid and naming which derived name failed; reference the
ContainerOptions.validate function and the KMSPluginName field when implementing
this additional validation so failures are caught before Pod creation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8730aa2e-bcd1-4d84-ab7e-e82b3679f449

📥 Commits

Reviewing files that changed from the base of the PR and between 90eaad7 and b0d6ae6.

📒 Files selected for processing (20)
  • examples/kms-health-kind/Dockerfile.fake-plugin
  • examples/kms-health-kind/Dockerfile.monitor
  • examples/kms-health-kind/Makefile
  • examples/kms-health-kind/bin/down.sh
  • examples/kms-health-kind/bin/up.sh
  • examples/kms-health-kind/cmd/monitor/main.go
  • examples/kms-health-kind/cmd/render-container/main.go
  • examples/kms-health-kind/fake-plugin/main.go
  • examples/kms-health-kind/manifests/encryption-config.yaml
  • examples/kms-health-kind/manifests/kind.yaml
  • examples/kms-health-kind/manifests/monitor-deployment.yaml
  • examples/kms-health-kind/manifests/namespace.yaml
  • examples/kms-health-kind/manifests/rbac.yaml
  • examples/kms-health-kind/manifests/static-pod.yaml
  • examples/kms-health-kind/verify.sh
  • pkg/operator/encryption/kms/health/assets/kms-health-container.yaml
  • pkg/operator/encryption/kms/health/bindata.go
  • pkg/operator/encryption/kms/health/container_template.go
  • pkg/operator/encryption/kms/health/container_template_test.go
  • pkg/operator/encryption/kms/health/writer_configmap.go
✅ Files skipped from review due to trivial changes (7)
  • examples/kms-health-kind/manifests/namespace.yaml
  • examples/kms-health-kind/manifests/kind.yaml
  • examples/kms-health-kind/manifests/encryption-config.yaml
  • pkg/operator/encryption/kms/health/bindata.go
  • examples/kms-health-kind/bin/down.sh
  • examples/kms-health-kind/cmd/render-container/main.go
  • examples/kms-health-kind/manifests/rbac.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • examples/kms-health-kind/Dockerfile.monitor
  • examples/kms-health-kind/Dockerfile.fake-plugin
  • examples/kms-health-kind/bin/up.sh
  • pkg/operator/encryption/kms/health/writer_configmap.go
  • examples/kms-health-kind/fake-plugin/main.go

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@ibihim: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-deps b0d6ae6 link true /test verify-deps

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants