Skip to content

fix(helm): Harden setup.sh Temporal namespace creation, generate Site UUID if not specified#2546

Merged
parmani-nv merged 1 commit into
NVIDIA:mainfrom
parmani-nv:bug/site_uuid_helm
Jun 26, 2026
Merged

fix(helm): Harden setup.sh Temporal namespace creation, generate Site UUID if not specified#2546
parmani-nv merged 1 commit into
NVIDIA:mainfrom
parmani-nv:bug/site_uuid_helm

Conversation

@parmani-nv

@parmani-nv parmani-nv commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

Hardens helm-prereqs/setup.sh so Temporal namespace provisioning fails loudly instead of silently continuing into worker CrashLoops:

  • Replaces the temporal operator namespace create ... 2>/dev/null || true calls with helpers that capture output, treat already exists as success, and exit with the real error otherwise.
  • Waits for the Temporal frontend/admin tools to be ready before creating namespaces.
  • Verifies the expected namespaces (cloud, site, flow, and the per-site UUID) actually exist after creation.

Also replaces the fixed default REST site UUID with a generated UUID (python3 -c 'import uuid; print(uuid.uuid4())') when NICO_SITE_UUID is unset, and removes the unused nico-rest-site-agent values block from helm-prereqs/values/nico-rest.yaml (the nico-rest umbrella chart has no nico-rest-site-agent dependency; the site-agent is deployed from its standalone chart with nico-site-agent.yaml). Docs updated to match.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Will test by running helm-prereqs/setup.sh against the dev cluster and confirming:

  • Temporal namespaces cloud, site, flow, and the generated site UUID are all created/verified.
  • A missing/failed namespace create now aborts setup with a clear error instead of continuing.
  • nico-rest and nico-rest-site-agent install cleanly with the generated NICO_SITE_UUID.

https://nvbugspro.nvidia.com/bug/6311869

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

The PR updates NICO_SITE_UUID documentation and runtime generation, changes Temporal namespace bootstrap logic in setup.sh, and removes the nico-rest-site-agent block.

Changes

Site UUID and Temporal bootstrap

Layer / File(s) Summary
NICO_SITE_UUID contract
book/src/configuration/configurability.md, docs/getting-started/quick-start.md, helm-prereqs/README.md, helm-prereqs/setup.sh
NICO_SITE_UUID is documented as an optional override, and setup.sh generates a random UUIDv4 with python3 when it is unset.
Temporal namespace bootstrap
helm-prereqs/setup.sh
setup.sh adds readiness checks, namespace creation with already exists handling, and namespace verification for the Temporal bootstrap and site-agent phases.
Site-agent values removal
helm-prereqs/values/nico-rest.yaml
The nico-rest-site-agent block is removed from the Helm values file.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2675: Also changes helm-prereqs/setup.sh around temporal operator namespace create, which is directly adjacent to the namespace bootstrap flow here.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description matches the implemented changes and documents the namespace handling, UUID behavior, and docs updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main changes: hardened Temporal namespace creation and random Site UUID generation when unset.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@parmani-nv parmani-nv force-pushed the bug/site_uuid_helm branch 2 times, most recently from eac3b6c to c6cc37b Compare June 13, 2026 00:34
# NICo postgres uses the 'nico' user and 'elektratest' database.
# CLUSTER_ID and TEMPORAL_SUBSCRIBE_* are set via --set in setup.sh
# using the NICO_SITE_UUID variable (default: a1b2c3d4-e5f6-4000-8000-000000000001).
nico-rest-site-agent:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the site agent config is being removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a usage/reference for it.
We're directly using the values/nico-site-agent.yaml which contains the same values.

    --namespace nico-rest
    -f "${SCRIPT_DIR}/values/nico-site-agent.yaml"
    --set global.image.repository="${NICO_IMAGE_REGISTRY}"
    --set global.image.tag="${NICO_REST_IMAGE_TAG}"
)```

@thossain-nv

Copy link
Copy Markdown
Contributor

@parmani-nv Is this ready? Can we move it out of draft?

@parmani-nv parmani-nv force-pushed the bug/site_uuid_helm branch 3 times, most recently from 52eba16 to a51ca41 Compare June 25, 2026 19:25
@parmani-nv parmani-nv marked this pull request as ready for review June 25, 2026 19:26
@parmani-nv parmani-nv requested review from a team as code owners June 25, 2026 19:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 782-788: The NICO_SITE_UUID generation in setup.sh is not
idempotent and changes the site identity on reruns. Update the logic around the
NICO_SITE_UUID branch so setup.sh reuses the previously deployed UUID instead of
minting a new one every time, and only generates a UUID on first install; locate
the existing assignment near the python3 fallback and make it read from the
persisted deployment source before falling back to uuid.uuid4(). Ensure the
site-agent continues to receive the same CLUSTER_ID and
TEMPORAL_SUBSCRIBE_NAMESPACE values across reruns unless an explicit rotation is
intended.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2807f2f-cdd9-4cdd-a2d0-0b4202785dda

📥 Commits

Reviewing files that changed from the base of the PR and between 2650ef8 and a51ca41.

📒 Files selected for processing (5)
  • book/src/configuration/configurability.md
  • docs/getting-started/quick-start.md
  • helm-prereqs/README.md
  • helm-prereqs/setup.sh
  • helm-prereqs/values/nico-rest.yaml
💤 Files with no reviewable changes (1)
  • helm-prereqs/values/nico-rest.yaml

Comment thread helm-prereqs/setup.sh
Comment on lines +782 to +788
if [[ -z "${NICO_SITE_UUID:-}" ]]; then
if ! command -v python3 &>/dev/null; then
echo "ERROR: NICO_SITE_UUID is unset and python3 is not available" >&2
exit 1
fi
NICO_SITE_UUID="$(python3 -c 'import uuid; print(uuid.uuid4())')"
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Persist the generated site UUID across reruns.

When NICO_SITE_UUID is unset, this branch mints a fresh UUID on every setup.sh invocation. That value is later wired into the site-agent as both CLUSTER_ID and TEMPORAL_SUBSCRIBE_NAMESPACE, so a routine rerun changes the site's identity, leaves the old Temporal namespace orphaned, and forces site-agent re-registration even though no rotation was requested. Please reuse the existing deployed UUID on reruns and only generate a new one on first install.

As per path instructions, helm-prereqs/**: review prerequisite Helm resources and scripts for install ordering, cluster-scope permissions, secret handling, idempotency, and clear failure messages.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 782 - 788, The NICO_SITE_UUID generation
in setup.sh is not idempotent and changes the site identity on reruns. Update
the logic around the NICO_SITE_UUID branch so setup.sh reuses the previously
deployed UUID instead of minting a new one every time, and only generates a UUID
on first install; locate the existing assignment near the python3 fallback and
make it read from the persisted deployment source before falling back to
uuid.uuid4(). Ensure the site-agent continues to receive the same CLUSTER_ID and
TEMPORAL_SUBSCRIBE_NAMESPACE values across reruns unless an explicit rotation is
intended.

Source: Path instructions

@parmani-nv parmani-nv force-pushed the bug/site_uuid_helm branch from a51ca41 to 11ab3eb Compare June 25, 2026 23:26
…UUID

setup.sh hid all Temporal namespace errors (`2>/dev/null || true`), so a
not-yet-ready frontend silently skipped namespace creation and left
REST/Flow workers CrashLooping. Add _wait_for_temporal,
_create_temporal_namespace, and _verify_temporal_namespaces to wait,
tolerate "already exists", and fail fast on any real error.

Generate a random NICO_SITE_UUID (python3) when unset instead of a shared
dev placeholder, so sites no longer collide on one default UUID.

Drop the unused nico-rest-site-agent block from values/nico-rest.yaml; it
is not a nico-rest subchart, and the site-agent ships from its own chart
via values/nico-site-agent.yaml.

Update the configurability, quick-start, and helm-prereqs docs for the
generated-UUID behavior.

Closes NVIDIA#2245

- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

- [ ] **This PR contains breaking changes**

- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

Signed-off-by: Parham Armani <parmani@nvidia.com>
@parmani-nv parmani-nv force-pushed the bug/site_uuid_helm branch from 11ab3eb to 242c124 Compare June 25, 2026 23:28
@github-actions

Copy link
Copy Markdown

@parmani-nv parmani-nv enabled auto-merge (squash) June 25, 2026 23:28
@thossain-nv thossain-nv changed the title fix(helm):harden setup.sh to properly handle temporal namespace creation + generate random siteUUID if not supplied fix(helm): Harden setup.sh Temporal namespace creation, generate Site UUID if not specified Jun 25, 2026

@thossain-nv thossain-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @parmani-nv

@parmani-nv parmani-nv merged commit 2acb7c3 into NVIDIA:main Jun 26, 2026
56 checks passed
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 283 6 24 98 7 148
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3265 134 776 1172 151 1032

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants