Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion book/src/configuration/configurability.md
Original file line number Diff line number Diff line change
Expand Up @@ -547,7 +547,7 @@ values files.
| `REGISTRY_PULL_SECRET` | Raw registry API key | **Raw key string** (e.g. `nvapi-...`). Not a file path. Not a JSON dockerconfig. |
| `REGISTRY_PULL_USERNAME` | Registry username | Defaults to `$oauthtoken` (correct for `nvcr.io`) |
| `KUBECONFIG` | Cluster kubeconfig | Filesystem path |
| `NICO_SITE_UUID` | Stable UUID for this site | UUIDv4. Defaults to a fixed dev UUID — override per real site. |
| `NICO_SITE_UUID` | Stable UUID for this site | UUIDv4. If unset, `setup.sh` generates a random UUID each run. |
| `PREFLIGHT_CHECK_IMAGE` | Image for per-node preflight checks | Defaults to `busybox:1.36`. Override for air-gapped clusters. |

Inside the cluster, `nico-api` discovers Vault, Postgres, and SPIFFE settings
Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Obtain an NGC API key at [ngc.nvidia.com](https://ngc.nvidia.com) → **API Keys
| `NICO_CORE_IMAGE_TAG` | **Yes** | NICo Core image tag (e.g. `v2025.12.30`). |
| `NICO_REST_IMAGE_TAG` | **Yes** | NICo REST image tag (e.g. `v1.0.4`). |
| `KUBECONFIG` | **Yes** | Path to your cluster kubeconfig. |
| `NICO_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. |
| `NICO_SITE_UUID` | No | Stable UUID for this site. If unset, `setup.sh` generates a random UUID each run. |

### 3b. Set your Site Name

Expand Down Expand Up @@ -245,7 +245,7 @@ All IPs must be within the `IPAddressPool` ranges you defined in `values/metallb

### 3i. (Optional) Set a Stable Site UUID

If you want a specific site UUID instead of the default placeholder, set the `NICO_SITE_UUID` environment variable:
If you want a specific site UUID instead of a random UUID generated by `setup.sh`, set the `NICO_SITE_UUID` environment variable:

```bash
export NICO_SITE_UUID=<your-uuid> # must be a valid UUID v4
Expand Down
2 changes: 1 addition & 1 deletion helm-prereqs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ The tables below summarize the keys that must be set per site.
| `NICO_IMAGE_REGISTRY` | Yes, unless `--skip-core --skip-rest` | Base image registry for all NICo images (e.g. `my-registry.example.com/nico`) |
| `NICO_CORE_IMAGE_TAG` | Yes, unless `--skip-core` | NICo Core image tag (e.g. `v2025.12.30-rc1`) |
| `NICO_REST_IMAGE_TAG` | Yes, unless `--skip-rest` | NICo REST image tag (e.g. `v1.0.4`) |
| `NICO_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. |
| `NICO_SITE_UUID` | No | Stable UUID for this site. If unset, `setup.sh` generates a random UUID each run. |
| `NICO_MANAGE_DEFAULT_STORAGE_CLASS` | No | Whether `setup.sh` marks `local-path` as the default StorageClass. Defaults to `true`. Set to `false` when the cluster already has an operator-managed default StorageClass. |
| `NICO_STORAGE_CLASS` | No | StorageClass used by Vault data/audit PVCs. Defaults to `local-path-persistent`. |
| `PREFLIGHT_CHECK_IMAGE` | No | Image used for preflight per-node checks. Defaults to `busybox:1.36`; set to a local mirror for air-gapped clusters. |
Expand Down
98 changes: 86 additions & 12 deletions helm-prereqs/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@
# preloaded, or use existing imagePullSecrets.
# REGISTRY_PULL_USERNAME Username for generated pull secrets.
# Default: $oauthtoken
# NICO_SITE_UUID Stable REST site UUID. Used only when REST is
# deployed. Default is a dev placeholder.
# NICO_SITE_UUID REST site UUID. Used only when REST is deployed.
# If unset, setup generates a random UUID each run.
# NICO_MANAGE_DEFAULT_STORAGE_CLASS
# Whether setup annotates local-path as the default
# StorageClass. Default: true.
Expand Down Expand Up @@ -629,13 +629,82 @@ _TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/t
--tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
--tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
--tls-server-name interservice.server.temporal.local"
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
_wait_for_temporal() {
local _output=""

echo "Waiting for Temporal frontend and admin tools..."
kubectl rollout status deploy/temporal-frontend -n temporal --timeout=120s
kubectl rollout status deploy/temporal-admintools -n temporal --timeout=120s

for _i in $(seq 1 24); do
if _output="$(kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace list --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>&1)"; then
echo "Temporal frontend ready"
return
fi
echo " Waiting for Temporal API (${_i}/24)..."
sleep 5
done

echo "ERROR: Temporal frontend is not ready for namespace operations" >&2
echo "${_output}" >&2
exit 1
}

_create_temporal_namespace() {
local _namespace="$1"
local _output

if _output="$(kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n \"\$1\" --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" \
sh "${_namespace}" 2>&1)"; then
echo "Temporal namespace ${_namespace} ready"
return
fi

if printf "%s" "${_output}" | grep -qi "already exists"; then
echo "Temporal namespace ${_namespace} already exists"
return
fi

echo "ERROR: failed to create Temporal namespace ${_namespace}" >&2
echo "${_output}" >&2
exit 1
}

_verify_temporal_namespaces() {
local _output
local _missing=()
local _namespace

if ! _output="$(kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace list --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>&1)"; then
echo "ERROR: failed to list Temporal namespaces" >&2
echo "${_output}" >&2
exit 1
fi

for _namespace in "$@"; do
if ! printf "%s" "${_output}" | grep -Eq "(^|[^[:alnum:]_-])${_namespace}([^[:alnum:]_-]|$)"; then
_missing+=("${_namespace}")
fi
done

if [[ ${#_missing[@]} -gt 0 ]]; then
echo "ERROR: missing Temporal namespace(s): ${_missing[*]}" >&2
echo "${_output}" >&2
exit 1
fi

echo "Verified Temporal namespaces: $*"
}

_wait_for_temporal
_create_temporal_namespace cloud
_create_temporal_namespace site
# flow Temporal namespace — required by NICo Flow workers; pod panics on startup if absent.
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
_create_temporal_namespace flow
_verify_temporal_namespaces cloud site flow
echo "Temporal namespaces ready"

_SETUP_PHASE="[7g/7] NICo REST helm chart"
Expand Down Expand Up @@ -710,8 +779,13 @@ fi
# All of this is wired via --set flags so nico-rest.yaml stays registry-agnostic.
NICO_SITE_AGENT_CHART="${NICO_REST_HELM_DIR}/nico-rest-site-agent"

# Stable placeholder UUID for this site (must be a valid UUID).
NICO_SITE_UUID="${NICO_SITE_UUID:-a1b2c3d4-e5f6-4000-8000-000000000001}"
if [[ -z "${NICO_SITE_UUID:-}" ]]; then
if ! command -v python3 &>/dev/null; then
echo "ERROR: NICO_SITE_UUID is unset and python3 is not available" >&2
exit 1
fi
NICO_SITE_UUID="$(python3 -c 'import uuid; print(uuid.uuid4())')"
fi
Comment on lines +782 to +788

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Persist the generated site UUID across reruns.

When NICO_SITE_UUID is unset, this branch mints a fresh UUID on every setup.sh invocation. That value is later wired into the site-agent as both CLUSTER_ID and TEMPORAL_SUBSCRIBE_NAMESPACE, so a routine rerun changes the site's identity, leaves the old Temporal namespace orphaned, and forces site-agent re-registration even though no rotation was requested. Please reuse the existing deployed UUID on reruns and only generate a new one on first install.

As per path instructions, helm-prereqs/**: review prerequisite Helm resources and scripts for install ordering, cluster-scope permissions, secret handling, idempotency, and clear failure messages.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 782 - 788, The NICO_SITE_UUID generation
in setup.sh is not idempotent and changes the site identity on reruns. Update
the logic around the NICO_SITE_UUID branch so setup.sh reuses the previously
deployed UUID instead of minting a new one every time, and only generates a UUID
on first install; locate the existing assignment near the python3 fallback and
make it read from the persisted deployment source before falling back to
uuid.uuid4(). Ensure the site-agent continues to receive the same CLUSTER_ID and
TEMPORAL_SUBSCRIBE_NAMESPACE values across reruns unless an explicit rotation is
intended.

Source: Path instructions


NICO_SITE_AGENT_ARGS=(
--namespace nico-rest
Expand Down Expand Up @@ -762,8 +836,8 @@ _TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/t
--tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
--tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
--tls-server-name interservice.server.temporal.local"
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n '${NICO_SITE_UUID}' --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
_create_temporal_namespace "${NICO_SITE_UUID}"
_verify_temporal_namespaces "${NICO_SITE_UUID}"
echo "Temporal namespace ready"

# FLOW_GRPC_ENABLED toggles the site-agent's Flow gRPC client (see
Expand Down
42 changes: 0 additions & 42 deletions helm-prereqs/values/nico-rest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,45 +38,3 @@ nico-rest-workflow:
replicaCount: 3
siteWorker:
replicaCount: 3

# Site-agent config — v1.0.4 binary reads DB config from env vars.
# NICo postgres uses the 'nico' user and 'elektratest' database.
# CLUSTER_ID and TEMPORAL_SUBSCRIBE_* are set via --set in setup.sh
# using the NICO_SITE_UUID variable (default: a1b2c3d4-e5f6-4000-8000-000000000001).
nico-rest-site-agent:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the site agent config is being removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a usage/reference for it.
We're directly using the values/nico-site-agent.yaml which contains the same values.

    --namespace nico-rest
    -f "${SCRIPT_DIR}/values/nico-site-agent.yaml"
    --set global.image.repository="${NICO_IMAGE_REGISTRY}"
    --set global.image.tag="${NICO_REST_IMAGE_TAG}"
)```

replicaCount: 3
bootstrap:
enabled: true
siteManager:
address: "nico-rest-site-manager.nico-rest:8100"
certificate:
# Service identifier must match "elektra-site-agent" for nico-api's SiteAgent RBAC role.
# The base path /nico-system/sa/ is one of nico-api's recognized spiffe_service_base_paths.
uris:
- "spiffe://nico.local/nico-system/sa/elektra-site-agent"
envConfig:
# DEV ONLY — these values match the dev postgres instance deployed by setup.sh.
# DB_USER and DB_PASSWORD are injected from the db-creds Secret (secrets.dbCreds).
DB_ADDR: "postgres.postgres.svc.cluster.local"
DB_DATABASE: "elektratest"
DB_PORT: "5432"
ESA_PORT: "8080"
METRICS_PORT: "2112"
DEV_MODE: "true"
ENABLE_DEBUG: "true"
ENABLE_TLS: "true"
# mTLS to nico-api (NICO_SEC_OPT=2). Cert issued from vault-nico-issuer
# so nico-api trusts it (same Vault PKI CA as nico-api's own cert).
NICO_ADDRESS: "nico-api.nico-system.svc.cluster.local:1079"
NICO_SEC_OPT: "2"
CLUSTER_ID: "a1b2c3d4-e5f6-4000-8000-000000000001"
TEMPORAL_HOST: "temporal-frontend.temporal"
TEMPORAL_PORT: "7233"
TEMPORAL_SERVER: "interservice.server.temporal.local"
TEMPORAL_PUBLISH_NAMESPACE: "site"
TEMPORAL_PUBLISH_QUEUE: "site"
TEMPORAL_SUBSCRIBE_NAMESPACE: "a1b2c3d4-e5f6-4000-8000-000000000001"
TEMPORAL_SUBSCRIBE_QUEUE: "site"
TEMPORAL_INVENTORY_SCHEDULE: "@every 3m"
TEMPORAL_CERT_PATH: "/etc/temporal-certs"
TEMPORAL_CERT: "temporal-client-site-agent-certs"
Loading