fix(helm): Harden setup.sh Temporal namespace creation, generate Site UUID if not specified#2546
Conversation
WalkthroughThe PR updates ChangesSite UUID and Temporal bootstrap
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
eac3b6c to
c6cc37b
Compare
| # NICo postgres uses the 'nico' user and 'elektratest' database. | ||
| # CLUSTER_ID and TEMPORAL_SUBSCRIBE_* are set via --set in setup.sh | ||
| # using the NICO_SITE_UUID variable (default: a1b2c3d4-e5f6-4000-8000-000000000001). | ||
| nico-rest-site-agent: |
There was a problem hiding this comment.
why is the site agent config is being removed?
There was a problem hiding this comment.
I couldn't find a usage/reference for it.
We're directly using the values/nico-site-agent.yaml which contains the same values.
--namespace nico-rest
-f "${SCRIPT_DIR}/values/nico-site-agent.yaml"
--set global.image.repository="${NICO_IMAGE_REGISTRY}"
--set global.image.tag="${NICO_REST_IMAGE_TAG}"
)```
|
@parmani-nv Is this ready? Can we move it out of draft? |
52eba16 to
a51ca41
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 782-788: The NICO_SITE_UUID generation in setup.sh is not
idempotent and changes the site identity on reruns. Update the logic around the
NICO_SITE_UUID branch so setup.sh reuses the previously deployed UUID instead of
minting a new one every time, and only generates a UUID on first install; locate
the existing assignment near the python3 fallback and make it read from the
persisted deployment source before falling back to uuid.uuid4(). Ensure the
site-agent continues to receive the same CLUSTER_ID and
TEMPORAL_SUBSCRIBE_NAMESPACE values across reruns unless an explicit rotation is
intended.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d2807f2f-cdd9-4cdd-a2d0-0b4202785dda
📒 Files selected for processing (5)
book/src/configuration/configurability.mddocs/getting-started/quick-start.mdhelm-prereqs/README.mdhelm-prereqs/setup.shhelm-prereqs/values/nico-rest.yaml
💤 Files with no reviewable changes (1)
- helm-prereqs/values/nico-rest.yaml
| if [[ -z "${NICO_SITE_UUID:-}" ]]; then | ||
| if ! command -v python3 &>/dev/null; then | ||
| echo "ERROR: NICO_SITE_UUID is unset and python3 is not available" >&2 | ||
| exit 1 | ||
| fi | ||
| NICO_SITE_UUID="$(python3 -c 'import uuid; print(uuid.uuid4())')" | ||
| fi |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Persist the generated site UUID across reruns.
When NICO_SITE_UUID is unset, this branch mints a fresh UUID on every setup.sh invocation. That value is later wired into the site-agent as both CLUSTER_ID and TEMPORAL_SUBSCRIBE_NAMESPACE, so a routine rerun changes the site's identity, leaves the old Temporal namespace orphaned, and forces site-agent re-registration even though no rotation was requested. Please reuse the existing deployed UUID on reruns and only generate a new one on first install.
As per path instructions, helm-prereqs/**: review prerequisite Helm resources and scripts for install ordering, cluster-scope permissions, secret handling, idempotency, and clear failure messages.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm-prereqs/setup.sh` around lines 782 - 788, The NICO_SITE_UUID generation
in setup.sh is not idempotent and changes the site identity on reruns. Update
the logic around the NICO_SITE_UUID branch so setup.sh reuses the previously
deployed UUID instead of minting a new one every time, and only generates a UUID
on first install; locate the existing assignment near the python3 fallback and
make it read from the persisted deployment source before falling back to
uuid.uuid4(). Ensure the site-agent continues to receive the same CLUSTER_ID and
TEMPORAL_SUBSCRIBE_NAMESPACE values across reruns unless an explicit rotation is
intended.
Source: Path instructions
a51ca41 to
11ab3eb
Compare
…UUID setup.sh hid all Temporal namespace errors (`2>/dev/null || true`), so a not-yet-ready frontend silently skipped namespace creation and left REST/Flow workers CrashLooping. Add _wait_for_temporal, _create_temporal_namespace, and _verify_temporal_namespaces to wait, tolerate "already exists", and fail fast on any real error. Generate a random NICO_SITE_UUID (python3) when unset instead of a shared dev placeholder, so sites no longer collide on one default UUID. Drop the unused nico-rest-site-agent block from values/nico-rest.yaml; it is not a nico-rest subchart, and the site-agent ships from its own chart via values/nico-site-agent.yaml. Update the configurability, quick-start, and helm-prereqs docs for the generated-UUID behavior. Closes NVIDIA#2245 - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) - [ ] **This PR contains breaking changes** - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Signed-off-by: Parham Armani <parmani@nvidia.com>
11ab3eb to
242c124
Compare
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2546.docs.buildwithfern.com/infra-controller |
thossain-nv
left a comment
There was a problem hiding this comment.
Looks good, thanks @parmani-nv
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Description
Hardens
helm-prereqs/setup.shso Temporal namespace provisioning fails loudly instead of silently continuing into worker CrashLoops:temporal operator namespace create ... 2>/dev/null || truecalls with helpers that capture output, treatalready existsas success, and exit with the real error otherwise.cloud,site,flow, and the per-site UUID) actually exist after creation.Also replaces the fixed default REST site UUID with a generated UUID (
python3 -c 'import uuid; print(uuid.uuid4())') whenNICO_SITE_UUIDis unset, and removes the unusednico-rest-site-agentvalues block fromhelm-prereqs/values/nico-rest.yaml(thenico-restumbrella chart has nonico-rest-site-agentdependency; the site-agent is deployed from its standalone chart withnico-site-agent.yaml). Docs updated to match.Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Will test by running
helm-prereqs/setup.shagainst the dev cluster and confirming:cloud,site,flow, and the generated site UUID are all created/verified.nico-restandnico-rest-site-agentinstall cleanly with the generatedNICO_SITE_UUID.https://nvbugspro.nvidia.com/bug/6311869