Skip to content

fix(site-agent): use mTLS for Flow gRPC and protect temporal-certs secret#2942

Open
shayan1995 wants to merge 1 commit into
NVIDIA:mainfrom
shayan1995:fix/flowgrpc-mtls-client-cert
Open

fix(site-agent): use mTLS for Flow gRPC and protect temporal-certs secret#2942
shayan1995 wants to merge 1 commit into
NVIDIA:mainfrom
shayan1995:fix/flowgrpc-mtls-client-cert

Conversation

@shayan1995

Copy link
Copy Markdown
Contributor

Three fixes for site-agent → Flow gRPC connectivity:

  • flow_client.go: switch FlowMutualTLS from tls.Config{Certificates} to GetClientCertificate — Go silently drops the cert when the issuer doesn't match the server's CA list, causing tls: certificate required
  • values.yaml: add FLOW_GRPC_SEC_OPT: "2" as chart default (was unset, defaulted to server-TLS only)
  • temporal-certs-secret.yaml: add helm.sh/resource-policy: keep — template was overwriting bootstrap-written certs on every helm upgrade, causing bootstrap to retry forever and block Flow gRPC startup

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Verified on dev6: site-agent pods reach 1/1 Running with 0 restarts and logs show FlowGrpcClient: Successfully connected to server. helm upgrade no longer overwrites temporal-client-site-agent-certs.

Additional Notes

@shayan1995 shayan1995 requested review from a team as code owners June 27, 2026 00:13
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features
    • Added support for additional Flow gRPC environment settings in the deployment configuration.
  • Bug Fixes
    • Improved TLS client certificate handling for Flow gRPC connections.
    • Preserved placeholder certificate secrets across Helm upgrades to avoid overwriting certificates during deployments.

Walkthrough

The Helm chart now preserves the Temporal client TLS placeholder secret across upgrades and adds Flow gRPC environment values. The Flow gRPC client now returns its client certificate through tls.Config.GetClientCertificate while retaining the existing root CAs.

Changes

Flow deployment and startup

Layer / File(s) Summary
Flow phase, certs, and chart bootstrap
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml, helm/rest/nico-rest-site-agent/values.yaml
The Temporal client TLS placeholder secret gains a helm.sh/resource-policy: keep annotation and updated comment, and envConfig adds FLOW_GRPC_ENABLED and FLOW_GRPC_SEC_OPT.
Flow client certificate selection
rest-api/site-workflow/pkg/grpc/client/flow_client.go
The Flow gRPC client now supplies its mutual TLS certificate through GetClientCertificate instead of Certificates, while keeping RootCAs unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: Flow gRPC mTLS and preserving the Temporal cert secret.
Description check ✅ Passed The description is directly aligned with the code changes and clearly explains the three fixes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-27 00:16:55 UTC | Commit: 22fb1d3

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/grpc/client/flow_client.go (1)

169-173: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Optional hardening: consider pinning MinVersion. Static analysis notes the absence of MinVersion. As a client, Go already negotiates a TLS 1.2 floor, so this is not a defect — but explicitly setting MinVersion: tls.VersionTLS13 would make the security posture intentional rather than implicit, provided the Flow server supports it.

🔒 Optional diff
 		mutualTLSConfig := &tls.Config{
+			MinVersion: tls.VersionTLS13,
 			GetClientCertificate: func(*tls.CertificateRequestInfo) (*tls.Certificate, error) {
 				return &clientCert, nil
 			},
 			RootCAs: capool,
 		}

Please confirm the Flow gRPC server negotiates TLS 1.3 before pinning, to avoid handshake regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/grpc/client/flow_client.go` around lines 169 -
173, The mutual TLS setup in flow client construction should explicitly set a
minimum TLS version if the Flow gRPC server supports it. Review the tls.Config
used with mutualTLSConfig and, before pinning, confirm server-side TLS 1.3
negotiation is available; if it is, add MinVersion to the tls.Config to make the
security posture explicit without regressing handshakes.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 807-816: The Docker config JSON in the registry pull secret setup
is being built with printf in a way that can emit invalid JSON when
REGISTRY_PULL_USERNAME or REGISTRY_PULL_SECRET contains special characters.
Update the shell logic in the registry secret construction block to use a proper
JSON encoder or equivalent escaping step before base64 encoding, keeping the
existing NICO_FLOW_ARGS and image-pull-secret wiring intact.

In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 4-6: Clarify the keep-policy behavior in the Secret template
comments for temporal-certs-secret.yaml: `helm.sh/resource-policy: keep`
preserves the Secret on uninstall, so a reinstall will reuse the existing object
unless it is manually deleted first. Update the explanatory text near the kept
Secret to reflect this behavior accurately, using the Secret template and its
bootstrap placeholder context as the reference point.

---

Nitpick comments:
In `@rest-api/site-workflow/pkg/grpc/client/flow_client.go`:
- Around line 169-173: The mutual TLS setup in flow client construction should
explicitly set a minimum TLS version if the Flow gRPC server supports it. Review
the tls.Config used with mutualTLSConfig and, before pinning, confirm
server-side TLS 1.3 negotiation is available; if it is, add MinVersion to the
tls.Config to make the security posture explicit without regressing handshakes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b4eeeb77-48f7-4aa0-836d-db000ecd1635

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb7a1 and 22fb1d3.

📒 Files selected for processing (6)
  • helm-prereqs/setup.sh
  • helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
  • helm/rest/nico-rest-site-agent/values.yaml
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
  • rest-api/site-workflow/pkg/grpc/client/flow_client.go

Comment thread helm-prereqs/setup.sh
Comment thread helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
…cret

Three related issues caused the site-agent to fail connecting to Flow gRPC:

1. FLOW_GRPC_SEC_OPT was missing from the site-agent Helm chart values, so
   the Flow gRPC client defaulted to FlowServerTLS (1). Flow requires mTLS
   (RequireAndVerifyClientCert); the server rejected the connection with
   "tls: certificate required". Fix: add FLOW_GRPC_SEC_OPT: "2" as the
   chart default so FlowMutualTLS is used by default.

2. The FlowMutualTLS case in NewFlowGrpcClient used tls.Config{Certificates: ...}
   to present the client cert. Go's TLS stack only selects a cert from that
   field when the cert's issuer matches the server's CertificateRequest acceptable
   CA list; when there is no match it silently sends no cert, and the server
   again rejects with "certificate required". Fix: use GetClientCertificate
   instead, which unconditionally returns the cert — the same approach used in
   rest-api/flow/pkg/certs/certs.go TLSConfig().

3. temporal-certs-secret.yaml created the secret with empty placeholder values
   on every helm upgrade, overwriting certs written by the bootstrap process.
   On the next pod restart the site-agent found the OTP gone, re-attempted
   bootstrap (which fails because the OTP was already consumed), and blocked
   forever — preventing FlowGrpc().Start() from running. Fix: add
   helm.sh/resource-policy: keep so upgrades never touch this secret after
   initial creation.
@shayan1995 shayan1995 force-pushed the fix/flowgrpc-mtls-client-cert branch from 22fb1d3 to 2f521ef Compare June 27, 2026 00:23
@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 26 102 7 144
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3267 134 778 1176 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml (1)

4-6: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Correct the reinstall behavior note for the kept Secret.

Line 6 still implies a release delete/reinstall creates a fresh placeholder. With helm.sh/resource-policy: keep, Helm skips deletion when an operation would remove the resource, so the kept Secret can survive and be reused unless the Secret itself is deleted first. (v3-1-0.helm.sh)

As per path instructions, helm/**: “Review Helm charts for template correctness, values compatibility, Kubernetes API compatibility, security context, RBAC, probes, and upgrade behavior.”

Suggested wording
-# bootstrap process; deleting and re-installing the release creates a fresh placeholder.
+# bootstrap process. The kept Secret also survives uninstall; delete the Secret
+# explicitly before reinstalling if a fresh placeholder is required.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml` around
lines 4 - 6, The comment in temporal-certs-secret.yaml overstates reinstall
behavior for the kept Secret. Update the explanatory text around the
helm.sh/resource-policy: keep note to say that Helm preserves the Secret across
delete/reinstall flows unless the Secret itself is removed first, and avoid
claiming a fresh placeholder is automatically created. Use the existing Temporal
client TLS certs / bootstrap placeholder wording so the guidance matches the
template’s actual behavior.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 4-6: The comment in temporal-certs-secret.yaml overstates
reinstall behavior for the kept Secret. Update the explanatory text around the
helm.sh/resource-policy: keep note to say that Helm preserves the Secret across
delete/reinstall flows unless the Secret itself is removed first, and avoid
claiming a fresh placeholder is automatically created. Use the existing Temporal
client TLS certs / bootstrap placeholder wording so the guidance matches the
template’s actual behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 43ce878f-8cda-407c-9e1f-e31238e9e4e8

📥 Commits

Reviewing files that changed from the base of the PR and between 22fb1d3 and 2f521ef.

📒 Files selected for processing (3)
  • helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
  • helm/rest/nico-rest-site-agent/values.yaml
  • rest-api/site-workflow/pkg/grpc/client/flow_client.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • helm/rest/nico-rest-site-agent/values.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant