Skip to content

Commit c6cc37b

Browse files
committed
Hardened helm-prereqs/setup.sh so Temporal namespace setup fails loudly instead of silently continuing into worker CrashLoops.
Replace the fixed default REST site UUID with a generated UUID when `NICO_SITE_UUID` is unset, and removes the unused `nico-rest-site-agent` values block from `helm-prereqs/values/nico-rest.yaml` because the site-agent is deployed from its standalone chart values file. Signed-off-by: Parham Armani <parmani@nvidia.com>
1 parent 0b8770e commit c6cc37b

5 files changed

Lines changed: 90 additions & 58 deletions

File tree

book/src/configuration/configurability.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -426,7 +426,7 @@ values files.
426426
| `REGISTRY_PULL_SECRET` | Raw registry API key | **Raw key string** (e.g. `nvapi-...`). Not a file path. Not a JSON dockerconfig. |
427427
| `REGISTRY_PULL_USERNAME` | Registry username | Defaults to `$oauthtoken` (correct for `nvcr.io`) |
428428
| `KUBECONFIG` | Cluster kubeconfig | Filesystem path |
429-
| `NICO_SITE_UUID` | Stable UUID for this site | UUIDv4. Defaults to a fixed dev UUID — override per real site. |
429+
| `NICO_SITE_UUID` | Stable UUID for this site | UUIDv4. If unset, `setup.sh` generates a random UUID each run. |
430430
| `PREFLIGHT_CHECK_IMAGE` | Image for per-node preflight checks | Defaults to `busybox:1.36`. Override for air-gapped clusters. |
431431

432432
Inside the cluster, `nico-api` discovers Vault, Postgres, and SPIFFE settings

docs/getting-started/quick-start.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ Obtain an NGC API key at [ngc.nvidia.com](https://ngc.nvidia.com) → **API Keys
7575
| `NICO_CORE_IMAGE_TAG` | **Yes** | NICo Core image tag (e.g. `v2025.12.30`). |
7676
| `NICO_REST_IMAGE_TAG` | **Yes** | NICo REST image tag (e.g. `v1.0.4`). |
7777
| `KUBECONFIG` | **Yes** | Path to your cluster kubeconfig. |
78-
| `NICO_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. |
78+
| `NICO_SITE_UUID` | No | Stable UUID for this site. If unset, `setup.sh` generates a random UUID each run. |
7979

8080
### 3b. Set your Site Name
8181

@@ -231,7 +231,7 @@ All IPs must be within the `IPAddressPool` ranges you defined in `values/metallb
231231

232232
### 3i. (Optional) Set a Stable Site UUID
233233

234-
If you want a specific site UUID instead of the default placeholder, set the `NICO_SITE_UUID` environment variable:
234+
If you want a specific site UUID instead of a random UUID generated by `setup.sh`, set the `NICO_SITE_UUID` environment variable:
235235

236236
```bash
237237
export NICO_SITE_UUID=<your-uuid> # must be a valid UUID v4

helm-prereqs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ The tables below summarize the keys that must be set per site.
101101
| `NICO_IMAGE_REGISTRY` | Yes, unless `--skip-core --skip-rest` | Base image registry for all NICo images (e.g. `my-registry.example.com/nico`) |
102102
| `NICO_CORE_IMAGE_TAG` | Yes, unless `--skip-core` | NICo Core image tag (e.g. `v2025.12.30-rc1`) |
103103
| `NICO_REST_IMAGE_TAG` | Yes, unless `--skip-rest` | NICo REST image tag (e.g. `v1.0.4`) |
104-
| `NICO_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. |
104+
| `NICO_SITE_UUID` | No | Stable UUID for this site. If unset, `setup.sh` generates a random UUID each run. |
105105
| `NICO_MANAGE_DEFAULT_STORAGE_CLASS` | No | Whether `setup.sh` marks `local-path` as the default StorageClass. Defaults to `true`. Set to `false` when the cluster already has an operator-managed default StorageClass. |
106106
| `NICO_STORAGE_CLASS` | No | StorageClass used by Vault data/audit PVCs. Defaults to `local-path-persistent`. |
107107
| `PREFLIGHT_CHECK_IMAGE` | No | Image used for preflight per-node checks. Defaults to `busybox:1.36`; set to a local mirror for air-gapped clusters. |

helm-prereqs/setup.sh

Lines changed: 86 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@
3737
# preloaded, or use existing imagePullSecrets.
3838
# REGISTRY_PULL_USERNAME Username for generated pull secrets.
3939
# Default: $oauthtoken
40-
# NICO_SITE_UUID Stable REST site UUID. Used only when REST is
41-
# deployed. Default is a dev placeholder.
40+
# NICO_SITE_UUID REST site UUID. Used only when REST is deployed.
41+
# If unset, setup generates a random UUID each run.
4242
# NICO_MANAGE_DEFAULT_STORAGE_CLASS
4343
# Whether setup annotates local-path as the default
4444
# StorageClass. Default: true.
@@ -629,13 +629,82 @@ _TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/t
629629
--tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
630630
--tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
631631
--tls-server-name interservice.server.temporal.local"
632-
kubectl exec -n temporal deploy/temporal-admintools -- \
633-
sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
634-
kubectl exec -n temporal deploy/temporal-admintools -- \
635-
sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
632+
_wait_for_temporal() {
633+
local _output=""
634+
635+
echo "Waiting for Temporal frontend and admin tools..."
636+
kubectl rollout status deploy/temporal-frontend -n temporal --timeout=120s
637+
kubectl rollout status deploy/temporal-admintools -n temporal --timeout=120s
638+
639+
for _i in $(seq 1 24); do
640+
if _output="$(kubectl exec -n temporal deploy/temporal-admintools -- \
641+
sh -c "temporal operator namespace list --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>&1)"; then
642+
echo "Temporal frontend ready"
643+
return
644+
fi
645+
echo " Waiting for Temporal API (${_i}/24)..."
646+
sleep 5
647+
done
648+
649+
echo "ERROR: Temporal frontend is not ready for namespace operations" >&2
650+
echo "${_output}" >&2
651+
exit 1
652+
}
653+
654+
_create_temporal_namespace() {
655+
local _namespace="$1"
656+
local _output
657+
658+
if _output="$(kubectl exec -n temporal deploy/temporal-admintools -- \
659+
sh -c "temporal operator namespace create -n \"\$1\" --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" \
660+
sh "${_namespace}" 2>&1)"; then
661+
echo "Temporal namespace ${_namespace} ready"
662+
return
663+
fi
664+
665+
if printf "%s" "${_output}" | grep -qi "already exists"; then
666+
echo "Temporal namespace ${_namespace} already exists"
667+
return
668+
fi
669+
670+
echo "ERROR: failed to create Temporal namespace ${_namespace}" >&2
671+
echo "${_output}" >&2
672+
exit 1
673+
}
674+
675+
_verify_temporal_namespaces() {
676+
local _output
677+
local _missing=()
678+
local _namespace
679+
680+
if ! _output="$(kubectl exec -n temporal deploy/temporal-admintools -- \
681+
sh -c "temporal operator namespace list --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>&1)"; then
682+
echo "ERROR: failed to list Temporal namespaces" >&2
683+
echo "${_output}" >&2
684+
exit 1
685+
fi
686+
687+
for _namespace in "$@"; do
688+
if ! printf "%s" "${_output}" | grep -Eq "(^|[^[:alnum:]_-])${_namespace}([^[:alnum:]_-]|$)"; then
689+
_missing+=("${_namespace}")
690+
fi
691+
done
692+
693+
if [[ ${#_missing[@]} -gt 0 ]]; then
694+
echo "ERROR: missing Temporal namespace(s): ${_missing[*]}" >&2
695+
echo "${_output}" >&2
696+
exit 1
697+
fi
698+
699+
echo "Verified Temporal namespaces: $*"
700+
}
701+
702+
_wait_for_temporal
703+
_create_temporal_namespace cloud
704+
_create_temporal_namespace site
636705
# flow Temporal namespace — required by NICo Flow workers; pod panics on startup if absent.
637-
kubectl exec -n temporal deploy/temporal-admintools -- \
638-
sh -c "temporal operator namespace create -n flow --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
706+
_create_temporal_namespace flow
707+
_verify_temporal_namespaces cloud site flow
639708
echo "Temporal namespaces ready"
640709

641710
_SETUP_PHASE="[7g/7] NICo REST helm chart"
@@ -710,8 +779,13 @@ fi
710779
# All of this is wired via --set flags so nico-rest.yaml stays registry-agnostic.
711780
NICO_SITE_AGENT_CHART="${NICO_REST_HELM_DIR}/nico-rest-site-agent"
712781

713-
# Stable placeholder UUID for this site (must be a valid UUID).
714-
NICO_SITE_UUID="${NICO_SITE_UUID:-a1b2c3d4-e5f6-4000-8000-000000000001}"
782+
if [[ -z "${NICO_SITE_UUID:-}" ]]; then
783+
if ! command -v python3 &>/dev/null; then
784+
echo "ERROR: NICO_SITE_UUID is unset and python3 is not available" >&2
785+
exit 1
786+
fi
787+
NICO_SITE_UUID="$(python3 -c 'import uuid; print(uuid.uuid4())')"
788+
fi
715789

716790
NICO_SITE_AGENT_ARGS=(
717791
--namespace nico-rest
@@ -762,8 +836,8 @@ _TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/t
762836
--tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
763837
--tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
764838
--tls-server-name interservice.server.temporal.local"
765-
kubectl exec -n temporal deploy/temporal-admintools -- \
766-
sh -c "temporal operator namespace create -n '${NICO_SITE_UUID}' --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
839+
_create_temporal_namespace "${NICO_SITE_UUID}"
840+
_verify_temporal_namespaces "${NICO_SITE_UUID}"
767841
echo "Temporal namespace ready"
768842

769843
# FLOW_GRPC_ENABLED toggles the site-agent's Flow gRPC client (see

helm-prereqs/values/nico-rest.yaml

Lines changed: 0 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -38,45 +38,3 @@ nico-rest-workflow:
3838
replicaCount: 3
3939
siteWorker:
4040
replicaCount: 3
41-
42-
# Site-agent config — v1.0.4 binary reads DB config from env vars.
43-
# NICo postgres uses the 'nico' user and 'elektratest' database.
44-
# CLUSTER_ID and TEMPORAL_SUBSCRIBE_* are set via --set in setup.sh
45-
# using the NICO_SITE_UUID variable (default: a1b2c3d4-e5f6-4000-8000-000000000001).
46-
nico-rest-site-agent:
47-
replicaCount: 3
48-
bootstrap:
49-
enabled: true
50-
siteManager:
51-
address: "nico-rest-site-manager.nico-rest:8100"
52-
certificate:
53-
# Service identifier must match "elektra-site-agent" for nico-api's SiteAgent RBAC role.
54-
# The base path /nico-system/sa/ is one of nico-api's recognized spiffe_service_base_paths.
55-
uris:
56-
- "spiffe://nico.local/nico-system/sa/elektra-site-agent"
57-
envConfig:
58-
# DEV ONLY — these values match the dev postgres instance deployed by setup.sh.
59-
# DB_USER and DB_PASSWORD are injected from the db-creds Secret (secrets.dbCreds).
60-
DB_ADDR: "postgres.postgres.svc.cluster.local"
61-
DB_DATABASE: "elektratest"
62-
DB_PORT: "5432"
63-
ESA_PORT: "8080"
64-
METRICS_PORT: "2112"
65-
DEV_MODE: "true"
66-
ENABLE_DEBUG: "true"
67-
ENABLE_TLS: "true"
68-
# mTLS to nico-api (NICO_SEC_OPT=2). Cert issued from vault-nico-issuer
69-
# so nico-api trusts it (same Vault PKI CA as nico-api's own cert).
70-
NICO_ADDRESS: "nico-api.nico-system.svc.cluster.local:1079"
71-
NICO_SEC_OPT: "2"
72-
CLUSTER_ID: "a1b2c3d4-e5f6-4000-8000-000000000001"
73-
TEMPORAL_HOST: "temporal-frontend.temporal"
74-
TEMPORAL_PORT: "7233"
75-
TEMPORAL_SERVER: "interservice.server.temporal.local"
76-
TEMPORAL_PUBLISH_NAMESPACE: "site"
77-
TEMPORAL_PUBLISH_QUEUE: "site"
78-
TEMPORAL_SUBSCRIBE_NAMESPACE: "a1b2c3d4-e5f6-4000-8000-000000000001"
79-
TEMPORAL_SUBSCRIBE_QUEUE: "site"
80-
TEMPORAL_INVENTORY_SCHEDULE: "@every 3m"
81-
TEMPORAL_CERT_PATH: "/etc/temporal-certs"
82-
TEMPORAL_CERT: "temporal-client-site-agent-certs"

0 commit comments

Comments
 (0)