fix(site-agent): fix flowgrpc CrashLoopBackOff on startup#2908
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
Summary by CodeRabbit
WalkthroughThe PR adds a gated Flow deployment phase in ChangesFlow setup phases
Flow gRPC Prometheus registration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
24b2d8d to
9781cba
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-26 03:51:17 UTC | Commit: 9781cba |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 859-888: The `_wait_for_secret` helper currently only checks that
a Secret exists, but it should also verify the required data keys for each
secret type. Update `_wait_for_secret` to accept expected key names and validate
them with `kubectl` after the Secret is found, then pass `token` for
`psm-vault-token` and `nsm-vault-token`, and `username`/`password` for the DB
credential secrets in the calls near the existing secret wait loops. Keep the
existing fast-fail behavior and make the error message identify the missing key
clearly so failures stay actionable.
- Around line 809-816: The Helm invocation in setup.sh is passing the
base64-encoded registry secret through argv via
imagePullSecret.dockerconfigjson, which exposes sensitive data in process args
and logs. Update the flow where NICO_FLOW_ARGS is built so helm no longer
receives the secret with --set; instead write the docker config to a 0600
temporary values file or use --set-file, or pre-create the Kubernetes Secret and
only pass its name. Keep the existing registry auth assembly logic around
_flow_docker_cfg and the imagePullSecret settings, but change how helm consumes
it.
In `@rest-api/site-agent/pkg/components/managers/flowgrpc/init.go`:
- Around line 39-44: The MetricFlowStatus registration logic in init() is
treating any prometheus.AlreadyRegisteredError as safe, which can hide a foreign
collector registration and leave flowgrpc health status unreported. Keep the
existing retry-safe behavior, but when prometheus.Register(gauge) returns
AlreadyRegisteredError, verify the existing collector matches the Flow status
gauge for elektra_site_agent_flow_grpc_health_status; if it does not, fail
startup with the same panic path instead of silently continuing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 388abc0a-00d0-4ff1-bdf1-c314b65ccf17
📒 Files selected for processing (3)
helm-prereqs/setup.shrest-api/site-agent/pkg/components/managers/flowgrpc/init.gorest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
| _flow_docker_cfg="$(printf '{"auths":{"%s":{"username":"%s","password":"%s"}}}' \ | ||
| "${_flow_registry_server}" \ | ||
| "${REGISTRY_PULL_USERNAME:-\$oauthtoken}" \ | ||
| "${REGISTRY_PULL_SECRET}" | base64 | tr -d '\n')" | ||
| NICO_FLOW_ARGS+=( | ||
| --set "global.imagePullSecrets[0].name=image-pull-secret" | ||
| --set "imagePullSecret.dockerconfigjson=${_flow_docker_cfg}" | ||
| ) |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Do not pass the registry secret through Helm argv.
imagePullSecret.dockerconfigjson contains the base64-encoded REGISTRY_PULL_SECRET; passing it via --set exposes it through process arguments and possibly CI command traces. Prefer a 0600 temporary values file/--set-file, or pre-create the Kubernetes Secret and pass only its name to Helm.
As per path instructions, helm-prereqs/**: “Review prerequisite Helm resources and scripts for ... secret handling.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm-prereqs/setup.sh` around lines 809 - 816, The Helm invocation in
setup.sh is passing the base64-encoded registry secret through argv via
imagePullSecret.dockerconfigjson, which exposes sensitive data in process args
and logs. Update the flow where NICO_FLOW_ARGS is built so helm no longer
receives the secret with --set; instead write the docker config to a 0600
temporary values file or use --set-file, or pre-create the Kubernetes Secret and
only pass its name. Keep the existing registry auth assembly logic around
_flow_docker_cfg and the imagePullSecret settings, but change how helm consumes
it.
Source: Path instructions
| _wait_for_secret() { | ||
| local _name="$1" | ||
| local _ns="$2" | ||
| local _hint="$3" | ||
| for _i in $(seq 1 24); do | ||
| if kubectl get secret "${_name}" -n "${_ns}" >/dev/null 2>&1; then | ||
| echo " ${_name} ready" | ||
| return 0 | ||
| fi | ||
| echo " Waiting for ${_name} (${_i}/24)..." | ||
| sleep 5 | ||
| done | ||
| echo "ERROR: Secret ${_name} did not appear in namespace ${_ns} within 120s." | ||
| echo " ${_hint}" | ||
| return 1 | ||
| } | ||
|
|
||
| echo "Waiting for psm/nsm Vault tokens..." | ||
| for _s in psm-vault-token nsm-vault-token; do | ||
| _wait_for_secret "${_s}" "${NICO_FLOW_NAMESPACE}" \ | ||
| "Provisioned by the flow-vault-tokens helm hook in nico-prereqs. Check 'kubectl logs -n nico-system job/flow-vault-tokens' and confirm helm-prereqs/values.yaml::flow.enabled=true." | ||
| done | ||
|
|
||
| echo "Waiting for flow/psm/nsm DB credentials..." | ||
| for _s in flow.nico.nico-pg-cluster.credentials \ | ||
| psm.nico.nico-pg-cluster.credentials \ | ||
| nsm.nico.nico-pg-cluster.credentials; do | ||
| _wait_for_secret "${_s}" "${NICO_FLOW_NAMESPACE}" \ | ||
| "Synced by the flow-db-eso/psm-db-eso/nsm-db-eso ClusterExternalSecrets in nico-prereqs. Check 'kubectl describe clusterexternalsecret -A | grep flow' and confirm helm-prereqs/values.yaml::flow.enabled=true." | ||
| done |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Validate required Secret keys, not just Secret existence.
The Flow chart reads token from the Vault token Secrets and username/password from each DB credential Secret; a Secret that exists but lacks those keys will still fail later during pod startup. Extend _wait_for_secret to check the expected keys so this phase preserves the intended fast, clear failure mode.
As per path instructions, helm-prereqs/**: “Review prerequisite Helm resources and scripts for ... clear failure messages.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm-prereqs/setup.sh` around lines 859 - 888, The `_wait_for_secret` helper
currently only checks that a Secret exists, but it should also verify the
required data keys for each secret type. Update `_wait_for_secret` to accept
expected key names and validate them with `kubectl` after the Secret is found,
then pass `token` for `psm-vault-token` and `nsm-vault-token`, and
`username`/`password` for the DB credential secrets in the calls near the
existing secret wait loops. Keep the existing fast-fail behavior and make the
error message identify the missing key clearly so failures stay actionable.
Source: Path instructions
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
54672c1 to
538dca8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 819-845: The namespace ownership step in setup.sh can silently
steal Helm metadata because the kubectl annotate/label calls on namespace
${NICO_FLOW_NAMESPACE} use --overwrite unconditionally. Update the flow
pre-apply logic to check existing meta.helm.sh/release-name and
meta.helm.sh/release-namespace on the namespace before touching ownership
metadata, and refuse with a clear failure if it is already owned by a different
release. Keep the current annotate/label flow only when the namespace is unowned
or already owned by flow, and reference the existing kubectl annotate namespace
and kubectl label namespace commands in the setup.sh prereq block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e518a538-f689-4859-bb80-72264cc8c96d
📒 Files selected for processing (3)
helm-prereqs/setup.shrest-api/site-agent/pkg/components/managers/flowgrpc/init.gorest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
🚧 Files skipped from review as they are similar to previous changes (2)
- rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
- rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
| # Pre-apply Certificates so cert-manager can issue secrets before the pod schedules. | ||
| echo "Pre-applying flow Certificates (SPIFFE + Temporal client)..." | ||
| helm template flow "${NICO_FLOW_CHART}" \ | ||
| "${NICO_FLOW_ARGS[@]}" \ | ||
| --show-only templates/namespace.yaml | kubectl apply -f - | ||
| helm template flow "${NICO_FLOW_CHART}" \ | ||
| "${NICO_FLOW_ARGS[@]}" \ | ||
| --show-only templates/certificate.yaml | kubectl apply -f - | ||
| kubectl annotate certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \ | ||
| "meta.helm.sh/release-name=flow" \ | ||
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | ||
| kubectl annotate certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \ | ||
| "meta.helm.sh/release-name=flow" \ | ||
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | ||
| kubectl label certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \ | ||
| "app.kubernetes.io/managed-by=Helm" --overwrite | ||
| kubectl label certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \ | ||
| "app.kubernetes.io/managed-by=Helm" --overwrite | ||
|
|
||
| # Annotate/label the namespace itself — the flow-vault-tokens-job (nico-prereqs | ||
| # helm hook) creates this namespace ahead of the flow release. Without Helm | ||
| # ownership metadata, helm install refuses to adopt it. | ||
| kubectl annotate namespace "${NICO_FLOW_NAMESPACE}" \ | ||
| "meta.helm.sh/release-name=flow" \ | ||
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | ||
| kubectl label namespace "${NICO_FLOW_NAMESPACE}" \ | ||
| "app.kubernetes.io/managed-by=Helm" --overwrite |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Guard before overwriting namespace Helm ownership.
Line 841 uses --overwrite on cluster-scoped namespace ownership metadata. If flow already belongs to another Helm release, this silently steals ownership and can break uninstall/upgrade behavior. Refuse unless the namespace is unowned or already owned by flow.
Proposed guard
+ if kubectl get namespace "${NICO_FLOW_NAMESPACE}" >/dev/null 2>&1; then
+ _flow_ns_owner="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \
+ -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}')"
+ _flow_ns_owner_ns="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \
+ -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-namespace}')"
+ if [[ -n "${_flow_ns_owner}${_flow_ns_owner_ns}" ]] && \
+ [[ "${_flow_ns_owner}/${_flow_ns_owner_ns}" != "flow/${NICO_FLOW_NAMESPACE}" ]]; then
+ echo "ERROR: namespace ${NICO_FLOW_NAMESPACE} is already owned by Helm release ${_flow_ns_owner}/${_flow_ns_owner_ns}; refusing to overwrite ownership metadata."
+ exit 1
+ fi
+ fi
+
# Pre-apply Certificates so cert-manager can issue secrets before the pod schedules.As per path instructions, helm-prereqs/**: “Review prerequisite Helm resources and scripts for install ordering, cluster-scope permissions, secret handling, idempotency, and clear failure messages.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Pre-apply Certificates so cert-manager can issue secrets before the pod schedules. | |
| echo "Pre-applying flow Certificates (SPIFFE + Temporal client)..." | |
| helm template flow "${NICO_FLOW_CHART}" \ | |
| "${NICO_FLOW_ARGS[@]}" \ | |
| --show-only templates/namespace.yaml | kubectl apply -f - | |
| helm template flow "${NICO_FLOW_CHART}" \ | |
| "${NICO_FLOW_ARGS[@]}" \ | |
| --show-only templates/certificate.yaml | kubectl apply -f - | |
| kubectl annotate certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \ | |
| "meta.helm.sh/release-name=flow" \ | |
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | |
| kubectl annotate certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \ | |
| "meta.helm.sh/release-name=flow" \ | |
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | |
| kubectl label certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \ | |
| "app.kubernetes.io/managed-by=Helm" --overwrite | |
| kubectl label certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \ | |
| "app.kubernetes.io/managed-by=Helm" --overwrite | |
| # Annotate/label the namespace itself — the flow-vault-tokens-job (nico-prereqs | |
| # helm hook) creates this namespace ahead of the flow release. Without Helm | |
| # ownership metadata, helm install refuses to adopt it. | |
| kubectl annotate namespace "${NICO_FLOW_NAMESPACE}" \ | |
| "meta.helm.sh/release-name=flow" \ | |
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | |
| kubectl label namespace "${NICO_FLOW_NAMESPACE}" \ | |
| "app.kubernetes.io/managed-by=Helm" --overwrite | |
| if kubectl get namespace "${NICO_FLOW_NAMESPACE}" >/dev/null 2>&1; then | |
| _flow_ns_owner="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \ | |
| -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}')" | |
| _flow_ns_owner_ns="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \ | |
| -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-namespace}')" | |
| if [[ -n "${_flow_ns_owner}${_flow_ns_owner_ns}" ]] && \ | |
| [[ "${_flow_ns_owner}/${_flow_ns_owner_ns}" != "flow/${NICO_FLOW_NAMESPACE}" ]]; then | |
| echo "ERROR: namespace ${NICO_FLOW_NAMESPACE} is already owned by Helm release ${_flow_ns_owner}/${_flow_ns_owner_ns}; refusing to overwrite ownership metadata." | |
| exit 1 | |
| fi | |
| fi | |
| # Pre-apply Certificates so cert-manager can issue secrets before the pod schedules. | |
| echo "Pre-applying flow Certificates (SPIFFE + Temporal client)..." | |
| helm template flow "${NICO_FLOW_CHART}" \ | |
| "${NICO_FLOW_ARGS[@]}" \ | |
| --show-only templates/namespace.yaml | kubectl apply -f - | |
| helm template flow "${NICO_FLOW_CHART}" \ | |
| "${NICO_FLOW_ARGS[@]}" \ | |
| --show-only templates/certificate.yaml | kubectl apply -f - | |
| kubectl annotate certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \ | |
| "meta.helm.sh/release-name=flow" \ | |
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | |
| kubectl annotate certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \ | |
| "meta.helm.sh/release-name=flow" \ | |
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | |
| kubectl label certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \ | |
| "app.kubernetes.io/managed-by=Helm" --overwrite | |
| kubectl label certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \ | |
| "app.kubernetes.io/managed-by=Helm" --overwrite | |
| # Annotate/label the namespace itself — the flow-vault-tokens-job (nico-prereqs | |
| # helm hook) creates this namespace ahead of the flow release. Without Helm | |
| # ownership metadata, helm install refuses to adopt it. | |
| kubectl annotate namespace "${NICO_FLOW_NAMESPACE}" \ | |
| "meta.helm.sh/release-name=flow" \ | |
| "meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite | |
| kubectl label namespace "${NICO_FLOW_NAMESPACE}" \ | |
| "app.kubernetes.io/managed-by=Helm" --overwrite |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm-prereqs/setup.sh` around lines 819 - 845, The namespace ownership step
in setup.sh can silently steal Helm metadata because the kubectl annotate/label
calls on namespace ${NICO_FLOW_NAMESPACE} use --overwrite unconditionally.
Update the flow pre-apply logic to check existing meta.helm.sh/release-name and
meta.helm.sh/release-namespace on the namespace before touching ownership
metadata, and refuse with a clear failure if it is already owned by a different
release. Keep the current annotate/label flow only when the namespace is unowned
or already owned by flow, and reference the existing kubectl annotate namespace
and kubectl label namespace commands in the setup.sh prereq block.
Source: Path instructions
538dca8 to
a1a80c5
Compare
Three stacked bugs caused nico-rest-site-agent to crash-loop permanently when Flow gRPC was enabled: 1. setup.sh phase ordering: site-agent deployed in 7h before Flow in 7i, so flow.flow.svc.cluster.local:50051 didn't exist when site-agent connected. Fix: deploy Flow first (now 7h), site-agent after (now 7i). Also fixes --skip-flow: the old early `exit 0` aborted before site-agent deployed; replaced with an if/fi wrapper. 2. Non-idempotent metrics registration: CreateGrpcClient is retried in a loop, and makeGrpcClientMetrics()/newWorkflowMetrics() called prometheus.MustRegister on every attempt — panicking on the second call. Fix: use Register + AlreadyRegisteredError recovery to reuse the existing collector on retry. 3. Same MustRegister panic in Init() for the MetricFlowStatus GaugeFunc, which would fire if Init() were called more than once. Fixed with the same pattern. Type assertions on AlreadyRegisteredError.ExistingCollector use the two-value form to produce a clear panic message on a wrong-type collision instead of a confusing nil-pointer error.
a1a80c5 to
eeda13d
Compare
Fix
nico-rest-site-agentpermanent CrashLoopBackOff when Flow gRPC is enabled.Related issues
Followup to #2840 (same pattern, different package).
Type of Change
Breaking Changes
Testing
Additional Notes
Three stacked bugs fixed in one commit:
flow.flow.svc.cluster.local:50051didn't exist at startup. Swapped order: Flow is now 7h, site-agent 7i. Also fixed--skip-flowearlyexit 0that was aborting before site-agent deployed.MustRegisterpanic on retry —makeGrpcClientMetrics()andnewWorkflowMetrics()calledprometheus.MustRegisteron everyCreateGrpcClientretry, panicking on the second attempt. Replaced withRegister+AlreadyRegisteredErrorrecovery (same fix as fix(site-agent): mount Core-gRPC certs where the binary reads them + … #2840 applied tocoregrpc, now applied toflowgrpc).MustRegisterinInit()— same issue for theMetricFlowStatusGaugeFunc.