Skip to content

fix(site-agent): fix flowgrpc CrashLoopBackOff on startup#2908

Merged
shayan1995 merged 1 commit into
NVIDIA:mainfrom
shayan1995:fix/flowgrpc-site-agent-startup
Jun 26, 2026
Merged

fix(site-agent): fix flowgrpc CrashLoopBackOff on startup#2908
shayan1995 merged 1 commit into
NVIDIA:mainfrom
shayan1995:fix/flowgrpc-site-agent-startup

Conversation

@shayan1995

Copy link
Copy Markdown
Contributor

Fix nico-rest-site-agent permanent CrashLoopBackOff when Flow gRPC is enabled.

Related issues

Followup to #2840 (same pattern, different package).

Type of Change

  • Fix - Bug fixes

Breaking Changes

  • This PR contains breaking changes

Testing

  • Manual testing performed — dev6 bring-up: site-agent no longer crash-loops, Flow gRPC connection succeeds.

Additional Notes

Three stacked bugs fixed in one commit:

  1. setup.sh ordering — Flow (7i) deployed after site-agent (7h), so flow.flow.svc.cluster.local:50051 didn't exist at startup. Swapped order: Flow is now 7h, site-agent 7i. Also fixed --skip-flow early exit 0 that was aborting before site-agent deployed.
  2. MustRegister panic on retrymakeGrpcClientMetrics() and newWorkflowMetrics() called prometheus.MustRegister on every CreateGrpcClient retry, panicking on the second attempt. Replaced with Register + AlreadyRegisteredError recovery (same fix as fix(site-agent): mount Core-gRPC certs where the binary reads them + … #2840 applied to coregrpc, now applied to flowgrpc).
  3. MustRegister in Init() — same issue for the MetricFlowStatus GaugeFunc.

@shayan1995 shayan1995 requested review from a team as code owners June 26, 2026 03:48
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47956b3f-f52e-4fab-ae6f-a7da9b69b6b8

📥 Commits

Reviewing files that changed from the base of the PR and between a1a80c5 and eeda13d.

📒 Files selected for processing (3)
  • helm-prereqs/setup.sh
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go

Summary by CodeRabbit

  • New Features
    • Added a dedicated Flow deployment phase (Phase 7h) before the REST site-agent, including certificate/secret readiness checks.
    • Updated --skip-flow to skip only Flow installation while continuing with the remaining setup phases.
  • Bug Fixes
    • Improved resilience during startup retries by tolerating duplicate Prometheus metric registrations and reusing existing collectors when compatible.
  • Documentation
    • Updated setup phase numbering/labels, --skip-flow help text, and reordered the completion/health-check output to reflect the new execution flow.

Walkthrough

The PR adds a gated Flow deployment phase in helm-prereqs/setup.sh, renumbers the site-agent phase, and changes Flow gRPC metric registration to reuse existing Prometheus collectors on duplicate registration.

Changes

Flow setup phases

Layer / File(s) Summary
Flow deployment phase
helm-prereqs/setup.sh
The script adds the Phase 7h Flow block, waits for cert-manager, Vault, and ESO prerequisites, and installs the nico-flow chart.
Site-agent phase renumbering
helm-prereqs/setup.sh
The site-agent phase becomes Phase 7i, the Flow comment matches the new order, and the setup-complete output follows the updated path.

Flow gRPC Prometheus registration

Layer / File(s) Summary
Gauge registration fallback
rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
MetricFlowStatus now uses prometheus.Register with AlreadyRegisteredError handling instead of MustRegister.
Histogram collector reuse
rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
The gRPC client and workflow latency histograms reuse existing collectors when registration reports AlreadyRegisteredError.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2840 — Adjusts Flow-related Prometheus collector registration to tolerate duplicate registration and reuse existing collectors.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main fix: flowgrpc startup CrashLoopBackOff in site-agent.
Description check ✅ Passed The description directly matches the changeset and explains the setup, metrics, and retry fixes in detail.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@shayan1995 shayan1995 force-pushed the fix/flowgrpc-site-agent-startup branch from 24b2d8d to 9781cba Compare June 26, 2026 03:49
@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-26 03:51:17 UTC | Commit: 9781cba

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 859-888: The `_wait_for_secret` helper currently only checks that
a Secret exists, but it should also verify the required data keys for each
secret type. Update `_wait_for_secret` to accept expected key names and validate
them with `kubectl` after the Secret is found, then pass `token` for
`psm-vault-token` and `nsm-vault-token`, and `username`/`password` for the DB
credential secrets in the calls near the existing secret wait loops. Keep the
existing fast-fail behavior and make the error message identify the missing key
clearly so failures stay actionable.
- Around line 809-816: The Helm invocation in setup.sh is passing the
base64-encoded registry secret through argv via
imagePullSecret.dockerconfigjson, which exposes sensitive data in process args
and logs. Update the flow where NICO_FLOW_ARGS is built so helm no longer
receives the secret with --set; instead write the docker config to a 0600
temporary values file or use --set-file, or pre-create the Kubernetes Secret and
only pass its name. Keep the existing registry auth assembly logic around
_flow_docker_cfg and the imagePullSecret settings, but change how helm consumes
it.

In `@rest-api/site-agent/pkg/components/managers/flowgrpc/init.go`:
- Around line 39-44: The MetricFlowStatus registration logic in init() is
treating any prometheus.AlreadyRegisteredError as safe, which can hide a foreign
collector registration and leave flowgrpc health status unreported. Keep the
existing retry-safe behavior, but when prometheus.Register(gauge) returns
AlreadyRegisteredError, verify the existing collector matches the Flow status
gauge for elektra_site_agent_flow_grpc_health_status; if it does not, fail
startup with the same panic path instead of silently continuing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 388abc0a-00d0-4ff1-bdf1-c314b65ccf17

📥 Commits

Reviewing files that changed from the base of the PR and between 9e904ee and 9781cba.

📒 Files selected for processing (3)
  • helm-prereqs/setup.sh
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go

Comment thread helm-prereqs/setup.sh
Comment on lines +809 to +816
_flow_docker_cfg="$(printf '{"auths":{"%s":{"username":"%s","password":"%s"}}}' \
"${_flow_registry_server}" \
"${REGISTRY_PULL_USERNAME:-\$oauthtoken}" \
"${REGISTRY_PULL_SECRET}" | base64 | tr -d '\n')"
NICO_FLOW_ARGS+=(
--set "global.imagePullSecrets[0].name=image-pull-secret"
--set "imagePullSecret.dockerconfigjson=${_flow_docker_cfg}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Do not pass the registry secret through Helm argv.

imagePullSecret.dockerconfigjson contains the base64-encoded REGISTRY_PULL_SECRET; passing it via --set exposes it through process arguments and possibly CI command traces. Prefer a 0600 temporary values file/--set-file, or pre-create the Kubernetes Secret and pass only its name to Helm.

As per path instructions, helm-prereqs/**: “Review prerequisite Helm resources and scripts for ... secret handling.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 809 - 816, The Helm invocation in
setup.sh is passing the base64-encoded registry secret through argv via
imagePullSecret.dockerconfigjson, which exposes sensitive data in process args
and logs. Update the flow where NICO_FLOW_ARGS is built so helm no longer
receives the secret with --set; instead write the docker config to a 0600
temporary values file or use --set-file, or pre-create the Kubernetes Secret and
only pass its name. Keep the existing registry auth assembly logic around
_flow_docker_cfg and the imagePullSecret settings, but change how helm consumes
it.

Source: Path instructions

Comment thread helm-prereqs/setup.sh
Comment on lines +859 to +888
_wait_for_secret() {
local _name="$1"
local _ns="$2"
local _hint="$3"
for _i in $(seq 1 24); do
if kubectl get secret "${_name}" -n "${_ns}" >/dev/null 2>&1; then
echo " ${_name} ready"
return 0
fi
echo " Waiting for ${_name} (${_i}/24)..."
sleep 5
done
echo "ERROR: Secret ${_name} did not appear in namespace ${_ns} within 120s."
echo " ${_hint}"
return 1
}

echo "Waiting for psm/nsm Vault tokens..."
for _s in psm-vault-token nsm-vault-token; do
_wait_for_secret "${_s}" "${NICO_FLOW_NAMESPACE}" \
"Provisioned by the flow-vault-tokens helm hook in nico-prereqs. Check 'kubectl logs -n nico-system job/flow-vault-tokens' and confirm helm-prereqs/values.yaml::flow.enabled=true."
done

echo "Waiting for flow/psm/nsm DB credentials..."
for _s in flow.nico.nico-pg-cluster.credentials \
psm.nico.nico-pg-cluster.credentials \
nsm.nico.nico-pg-cluster.credentials; do
_wait_for_secret "${_s}" "${NICO_FLOW_NAMESPACE}" \
"Synced by the flow-db-eso/psm-db-eso/nsm-db-eso ClusterExternalSecrets in nico-prereqs. Check 'kubectl describe clusterexternalsecret -A | grep flow' and confirm helm-prereqs/values.yaml::flow.enabled=true."
done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Validate required Secret keys, not just Secret existence.

The Flow chart reads token from the Vault token Secrets and username/password from each DB credential Secret; a Secret that exists but lacks those keys will still fail later during pod startup. Extend _wait_for_secret to check the expected keys so this phase preserves the intended fast, clear failure mode.

As per path instructions, helm-prereqs/**: “Review prerequisite Helm resources and scripts for ... clear failure messages.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 859 - 888, The `_wait_for_secret` helper
currently only checks that a Secret exists, but it should also verify the
required data keys for each secret type. Update `_wait_for_secret` to accept
expected key names and validate them with `kubectl` after the Secret is found,
then pass `token` for `psm-vault-token` and `nsm-vault-token`, and
`username`/`password` for the DB credential secrets in the calls near the
existing secret wait loops. Keep the existing fast-fail behavior and make the
error message identify the missing key clearly so failures stay actionable.

Source: Path instructions

Comment thread rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 26 102 7 144
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3267 134 778 1176 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@shayan1995 shayan1995 force-pushed the fix/flowgrpc-site-agent-startup branch 2 times, most recently from 54672c1 to 538dca8 Compare June 26, 2026 18:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 819-845: The namespace ownership step in setup.sh can silently
steal Helm metadata because the kubectl annotate/label calls on namespace
${NICO_FLOW_NAMESPACE} use --overwrite unconditionally. Update the flow
pre-apply logic to check existing meta.helm.sh/release-name and
meta.helm.sh/release-namespace on the namespace before touching ownership
metadata, and refuse with a clear failure if it is already owned by a different
release. Keep the current annotate/label flow only when the namespace is unowned
or already owned by flow, and reference the existing kubectl annotate namespace
and kubectl label namespace commands in the setup.sh prereq block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e518a538-f689-4859-bb80-72264cc8c96d

📥 Commits

Reviewing files that changed from the base of the PR and between 54672c1 and 538dca8.

📒 Files selected for processing (3)
  • helm-prereqs/setup.sh
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go

Comment thread helm-prereqs/setup.sh
Comment on lines +819 to +845
# Pre-apply Certificates so cert-manager can issue secrets before the pod schedules.
echo "Pre-applying flow Certificates (SPIFFE + Temporal client)..."
helm template flow "${NICO_FLOW_CHART}" \
"${NICO_FLOW_ARGS[@]}" \
--show-only templates/namespace.yaml | kubectl apply -f -
helm template flow "${NICO_FLOW_CHART}" \
"${NICO_FLOW_ARGS[@]}" \
--show-only templates/certificate.yaml | kubectl apply -f -
kubectl annotate certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl annotate certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl label certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
kubectl label certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite

# Annotate/label the namespace itself — the flow-vault-tokens-job (nico-prereqs
# helm hook) creates this namespace ahead of the flow release. Without Helm
# ownership metadata, helm install refuses to adopt it.
kubectl annotate namespace "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl label namespace "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Guard before overwriting namespace Helm ownership.

Line 841 uses --overwrite on cluster-scoped namespace ownership metadata. If flow already belongs to another Helm release, this silently steals ownership and can break uninstall/upgrade behavior. Refuse unless the namespace is unowned or already owned by flow.

Proposed guard
+    if kubectl get namespace "${NICO_FLOW_NAMESPACE}" >/dev/null 2>&1; then
+        _flow_ns_owner="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \
+            -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}')"
+        _flow_ns_owner_ns="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \
+            -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-namespace}')"
+        if [[ -n "${_flow_ns_owner}${_flow_ns_owner_ns}" ]] && \
+           [[ "${_flow_ns_owner}/${_flow_ns_owner_ns}" != "flow/${NICO_FLOW_NAMESPACE}" ]]; then
+            echo "ERROR: namespace ${NICO_FLOW_NAMESPACE} is already owned by Helm release ${_flow_ns_owner}/${_flow_ns_owner_ns}; refusing to overwrite ownership metadata."
+            exit 1
+        fi
+    fi
+
     # Pre-apply Certificates so cert-manager can issue secrets before the pod schedules.

As per path instructions, helm-prereqs/**: “Review prerequisite Helm resources and scripts for install ordering, cluster-scope permissions, secret handling, idempotency, and clear failure messages.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Pre-apply Certificates so cert-manager can issue secrets before the pod schedules.
echo "Pre-applying flow Certificates (SPIFFE + Temporal client)..."
helm template flow "${NICO_FLOW_CHART}" \
"${NICO_FLOW_ARGS[@]}" \
--show-only templates/namespace.yaml | kubectl apply -f -
helm template flow "${NICO_FLOW_CHART}" \
"${NICO_FLOW_ARGS[@]}" \
--show-only templates/certificate.yaml | kubectl apply -f -
kubectl annotate certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl annotate certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl label certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
kubectl label certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
# Annotate/label the namespace itself — the flow-vault-tokens-job (nico-prereqs
# helm hook) creates this namespace ahead of the flow release. Without Helm
# ownership metadata, helm install refuses to adopt it.
kubectl annotate namespace "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl label namespace "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
if kubectl get namespace "${NICO_FLOW_NAMESPACE}" >/dev/null 2>&1; then
_flow_ns_owner="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \
-o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}')"
_flow_ns_owner_ns="$(kubectl get namespace "${NICO_FLOW_NAMESPACE}" \
-o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-namespace}')"
if [[ -n "${_flow_ns_owner}${_flow_ns_owner_ns}" ]] && \
[[ "${_flow_ns_owner}/${_flow_ns_owner_ns}" != "flow/${NICO_FLOW_NAMESPACE}" ]]; then
echo "ERROR: namespace ${NICO_FLOW_NAMESPACE} is already owned by Helm release ${_flow_ns_owner}/${_flow_ns_owner_ns}; refusing to overwrite ownership metadata."
exit 1
fi
fi
# Pre-apply Certificates so cert-manager can issue secrets before the pod schedules.
echo "Pre-applying flow Certificates (SPIFFE + Temporal client)..."
helm template flow "${NICO_FLOW_CHART}" \
"${NICO_FLOW_ARGS[@]}" \
--show-only templates/namespace.yaml | kubectl apply -f -
helm template flow "${NICO_FLOW_CHART}" \
"${NICO_FLOW_ARGS[@]}" \
--show-only templates/certificate.yaml | kubectl apply -f -
kubectl annotate certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl annotate certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl label certificate/flow-certificate -n "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
kubectl label certificate/temporal-client-certs -n "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
# Annotate/label the namespace itself — the flow-vault-tokens-job (nico-prereqs
# helm hook) creates this namespace ahead of the flow release. Without Helm
# ownership metadata, helm install refuses to adopt it.
kubectl annotate namespace "${NICO_FLOW_NAMESPACE}" \
"meta.helm.sh/release-name=flow" \
"meta.helm.sh/release-namespace=${NICO_FLOW_NAMESPACE}" --overwrite
kubectl label namespace "${NICO_FLOW_NAMESPACE}" \
"app.kubernetes.io/managed-by=Helm" --overwrite
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 819 - 845, The namespace ownership step
in setup.sh can silently steal Helm metadata because the kubectl annotate/label
calls on namespace ${NICO_FLOW_NAMESPACE} use --overwrite unconditionally.
Update the flow pre-apply logic to check existing meta.helm.sh/release-name and
meta.helm.sh/release-namespace on the namespace before touching ownership
metadata, and refuse with a clear failure if it is already owned by a different
release. Keep the current annotate/label flow only when the namespace is unowned
or already owned by flow, and reference the existing kubectl annotate namespace
and kubectl label namespace commands in the setup.sh prereq block.

Source: Path instructions

@shayan1995 shayan1995 force-pushed the fix/flowgrpc-site-agent-startup branch from 538dca8 to a1a80c5 Compare June 26, 2026 19:00
Three stacked bugs caused nico-rest-site-agent to crash-loop permanently
when Flow gRPC was enabled:

1. setup.sh phase ordering: site-agent deployed in 7h before Flow in 7i,
   so flow.flow.svc.cluster.local:50051 didn't exist when site-agent
   connected. Fix: deploy Flow first (now 7h), site-agent after (now 7i).
   Also fixes --skip-flow: the old early `exit 0` aborted before
   site-agent deployed; replaced with an if/fi wrapper.

2. Non-idempotent metrics registration: CreateGrpcClient is retried in a
   loop, and makeGrpcClientMetrics()/newWorkflowMetrics() called
   prometheus.MustRegister on every attempt — panicking on the second
   call. Fix: use Register + AlreadyRegisteredError recovery to reuse the
   existing collector on retry.

3. Same MustRegister panic in Init() for the MetricFlowStatus GaugeFunc,
   which would fire if Init() were called more than once. Fixed with the
   same pattern.

Type assertions on AlreadyRegisteredError.ExistingCollector use the
two-value form to produce a clear panic message on a wrong-type collision
instead of a confusing nil-pointer error.
@shayan1995 shayan1995 force-pushed the fix/flowgrpc-site-agent-startup branch from a1a80c5 to eeda13d Compare June 26, 2026 19:06
@shayan1995 shayan1995 enabled auto-merge (squash) June 26, 2026 19:34
@shayan1995 shayan1995 merged commit 1fd0034 into NVIDIA:main Jun 26, 2026
113 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants