Design: Declarative Keycloak configuration by dcmcand · Pull Request #154 · nebari-dev/nebari-infrastructure-core

dcmcand · 2026-03-12T17:16:25Z

Summary

Design doc proposing a declarative Keycloak configuration system to replace the current imperative realm-setup-job.yaml shell script.

Uses keycloak-config-cli to apply YAML config via Keycloak Admin API
Sensible defaults (nebari realm, roles, admin user) overridable via config.yaml
Backup/restore round-trip: export current state, restore to fresh instance
Safe coexistence with nebari-operator via managed resource prefixes

See docs/superpowers/specs/2026-03-12-declarative-keycloak-config-design.md for the full design.

Looking for feedback on the approach, config schema, and open questions at the bottom of the doc.

Proposes replacing the imperative realm-setup-job shell script with keycloak-config-cli for declarative Keycloak configuration. Covers config schema, defaults/merge strategy, backup/restore, and operator coexistence.

viniciusdc · 2026-03-23T14:51:34Z

Overall, I like this approach. As discussed, the main advantage is the declarative model compared to what we have today.

That said, I have two suggestions:

For handling secrets and credentials declaratively, I do not think we should depend directly on plain environment variables. Instead, NIC should create Kubernetes Secrets for those values. The declarative config can still reference env vars, but they should be injected into the pod from Kubernetes Secrets so the sidecar container can perform its work correctly.
I think the backup and restore discussion should be separated from this proposal. The current document does not directly depend on those decisions, and backup/restore is honestly a different can of worms.

viniciusdc · 2026-04-10T12:56:27Z

Hey @dcmcand, I was looking over this design proposal again, and one question popped into my mind: should the realm configuration really live in a new keycloak section in NIC's config.yaml?

Keycloak realm config (roles, identity providers, auth flows, users) feels like application-level policy rather than infrastructure. Pulling it into config.yaml means NIC has to own the deep-merge logic, ${env:*} scanning, Secret generation, and a schema that tracks keycloak-config-cli's input format — all for something that, IMO, sits outside NIC's natural domain (cloud resources, clusters, platform service deployment).

An alternative: NIC deploys the Keycloak Argo app with sensible defaults baked into the templates (same defaults the current shell script creates), and users customize by patching the gitops repo directly (Kustomize overlays, Helm values, or editing the ConfigMap). This would:

Keep NIC scoped to infrastructure — no new structs, no merge logic, no version coupling with keycloak-config-cli's input format
Let users configure any keycloak-config-cli feature without waiting for NIC to expose it
Let secrets be handled by whatever the cluster already uses (Sealed Secrets, External Secrets Operator, SOPS) instead of a NIC-specific env var scanning mechanism
Follow the same pattern as other Argo-managed workloads in the platform

The trade-off is losing the "single config.yaml controls everything" convenience, but users who need to customize Keycloak are likely already comfortable working with the gitops repo.

What do you think? Was there a specific reason for routing this through config.yaml rather than keeping it at the Argo app template level?

viniciusdc

Hey @dcmcand, did a deep read of the proposal alongside the current code (pkg/argocd/foundational.go, realm-setup-job.yaml, pkg/argocd/writer.go) and also went digging into how kcc actually works under the hood. Leaving a batch of inline comments — treating these as discussion-starters since the ADR is WIP, not a "fix this" pass.

The two big ones are Comment 1 (operator coexistence, kcc has a simpler mechanism than the prefix model described in the doc, we can lean on import.remote-state.enabled=true instead) and Comment 4 (the proposed NIC-side deep-merge engine probably isn't needed if we use kcc's remote-state for reconciliation). Most other comments build on those.

Several findings end up simplifying the design rather than asking for more spec — that's the spirit here. Open to being wrong on any of these.

viniciusdc · 2026-05-08T19:14:20Z

+
+### Operator Coexistence
+
+keycloak-config-cli supports "managed resources" with a configurable prefix. Resources created by this system are tagged with a `nebari:` prefix. The operator's dynamically created OAuth2 clients are untagged and left untouched. This is the boundary:


I went digging through kcc's docs and source to figure out which env vars back this prefix mechanism, and I think we can lean on something simpler. kcc doesn't actually have a name-prefix policy for managed resources: I checked application.properties, MANAGED.md, and the issue tracker. The closest equivalents are type-level: omit a resource type from the input entirely, or set IMPORT_MANAGED_<type>=no-delete.

There's a cleaner pattern that IMO gives us better operator-coexistence guarantees with less coupling:

Don't include clients: in the input config at all. Per adorsys/keycloak-config-cli#1306, kcc treats "key missing" as "leave this resource type alone" (different from clients: [] which means "delete all"). Operator owns clients; NIC owns realm-level config.

IMPORT_REMOTESTATE_ENABLED=true (default in 6.x): kcc tracks resources it itself created in a Keycloak realm attribute. Anything the operator creates at runtime is invisible to kcc's delete logic, even if clients: were in the input. Stored as de.adorsys.keycloak.config.state-* inside Keycloak's DB, no extra PVC.

IMPORT_MANAGED_CLIENT=no-delete: belt-and-braces. Even if a future change accidentally listed clients in the input, kcc still wouldn't delete unlisted ones.

Three independent layers protecting operator-created clients, and we don't have to retrofit a nebari: prefix onto existing clients (which would be a migration concern in itself, since today's operator-created clients don't have one).

Worth rewriting this section around those primitives? Happy to draft something.

You were right to dig into this - the prefix mechanism doesn't exist in kcc, and your three-layer alternative is meaningfully better. Doc rewrite (just pushed) leans on exactly what you proposed: omit clients: from the input entirely, rely on import.remote-state.enabled=true (the v6.x default), set import.managed.client=no-delete for belt-and-braces. That also kills the migration concern about retrofitting prefixes onto existing operator-created clients.

Verified against kcc v6.5.0's application.properties and MANAGED.md before adopting. The doc cites both in the rewrite.

viniciusdc · 2026-05-08T19:14:20Z

+
+### Secrets Handling
+
+Secrets never go in `config.yaml`. The config references environment variable names using keycloak-config-cli's `${env:VAR}` substitution syntax. These references are written literally into the ConfigMap - no substitution happens in NIC.


Two things to flag here.

Syntax: kcc 6.x uses $(env:VAR) with parens, not ${env:VAR} with braces: the prefix changed at v4.0 to avoid colliding with Keycloak's own builtin variables like ${role_uma_authorization}. Default-value form $(env:VAR:fallback) works too. Worth fixing in the schema example below so users don't copy-paste a syntax that kcc won't substitute.

Substitution is also off by default (import.var-substitution.enabled=false), so we'll need IMPORT_VARSUBSTITUTION_ENABLED=true explicitly in the Job manifest. With IMPORT_VARSUBSTITUTION_UNDEFINEDISERROR=true (default) we get fail-fast on missing env vars — which is the behavior described here.

Bigger thing: secret lifecycle. Reading this alongside pkg/argocd/foundational.go:222-292, I think we want to be explicit about which pipeline owns which secrets. Today NIC creates four Secrets directly (keycloak-admin-credentials, nebari-realm-admin-credentials, etc.) with no-op-if-exists semantics that let operators rotate in-cluster without nic deploy overwriting them.

I think both pipelines should coexist, scoped clearly:

NIC-generated secrets (admin password, realm admin, DB) → keep current pipeline. Not user-supplied; rotation is by re-generation.

User-supplied via $(env:VAR) (IdP client secrets, custom credentials) → new pipeline, separate Secret name, overwrite-on-deploy (since .env is the source of truth for these).

Could the doc spec the new Secret's name/namespace and confirm nebari-realm-admin-credentials keeps its current handling? Reframing my earlier thread comment

Both points addressed in the rewrite. Every ${env:VAR} is now $(env:VAR) and the doc calls out that import.var-substitution.enabled=true has to be set explicitly on the Job (kcc's default is off).

For the secret pipeline: split as you suggested into NIC-generated (admin, DB, realm-admin - existing pipeline at foundational.go:222-292 unchanged with its no-op-if-exists semantics) and user-supplied (new Secret keycloak-config-user-secrets, overwrite-on-deploy because .env is the source of truth). Both Secrets mount to the kcc Job. nebari-realm-admin-credentials keeps its current handling exactly.

viniciusdc · 2026-05-08T19:14:20Z

+          enabled: true
+          config:
+            clientId: "${env:GITHUB_CLIENT_ID}"
+            clientSecret: "${env:GITHUB_CLIENT_SECRET}"


Quick follow-up to the syntax note above: once we settle on $(env:...), this example block needs the parens too:

clientSecret: "$(env:GITHUB_CLIENT_SECRET)" # ... value: "$(env:REALM_ADMIN_PASSWORD)"

Default-value form is also useful for IdP configs where missing env vars should fall back to disabled-by-default rather than failing the deploy:

clientId: "$(env:GITHUB_CLIENT_ID:disabled)"

Optional: depends on whether we want missing IdP secrets to fail loudly or be silently disabled. IMO fail-loud for the realm admin password and other load-bearing values, fall-back-to-disabled for optional IdPs. Worth a paragraph on the philosophy.

Fixed throughout the examples. On the fail-loud vs fall-back-to-disabled question: the rewrite takes a stronger stance than just philosophy. NIC's pre-deploy scanner will block deploys where a sensitive field (identityProviders[].config.clientSecret, etc.) references an env var that resolves to empty or unset. The reasoning: a silent fallback for IdP secrets means GitHub login silently breaks on next deploy, which is a much worse failure mode than "deploy errored, fix your env." For non-secret fields where a missing-equals-disabled flow makes sense, kcc's own substitution handles it.

viniciusdc · 2026-05-08T19:14:20Z

+
+- **Scalars**: config.yaml value replaces default
+- **Lists** (users, roles, identityProviders): merged by key field (`username`, `name`, `alias`). Matching key overrides that entry, new keys are appended
+- **Removal**: set `enabled: false` on a default entry


I think this section can simplify a lot once we lean on kcc's remote-state behavior. Reading it now, NIC takes on:

A deep-merge engine over a deeply nested upstream schema (Keycloak realm export)

List-merge-by-key rules that we'd need to spec for ~10 list-shaped fields beyond the three named here: clientScopes, authenticationFlows[].authenticationExecutions[], recursive groups, requiredActions, etc.

A "removal" semantic via enabled: false (which in Keycloak actually means disable, not delete: the alias/username still occupies its namespace)

If we set IMPORT_REMOTESTATE_ENABLED=true (default) and let kcc do the reconciliation, NIC's job shrinks to:

Render baked-in defaults (current realm-setup-job's resources) as a base config

Render the user's keycloak: section over the base: shallow merge by top-level key, kcc handles the deep diff downstream

Write the result as a ConfigMap

What this gives us:

First run: kcc creates everything declared.

Subsequent runs: kcc only touches resources it itself created (tracked in de.adorsys.keycloak.config.state-* realm attribute). Direct-in-Keycloak admin edits to unmanaged resources are preserved: which IMO matches the intent of "don't wipe data important to the user state of the cluster."

User-facing: users only declare changes, not the whole realm. A user adding a GitHub IdP just declares the IdP; defaults stay defaulted.

The "removal via enabled: false" thing goes away entirely (kcc manages deletions through its own state), and we don't have to spec merge keys for every nested list shape.

The trade-off: kcc reconciles back to input on next run for resources it does manage. That's by design and matches what users expect from a declarative tool, but worth being explicit in the doc — "edits to NIC-managed resources via Keycloak UI will be reverted on next sync; edits to admin-created resources outside the input are preserved."

Is there a reason we wanted NIC to own merge logic that I'm missing? My read is it adds surface without buying anything kcc doesn't already give us.

This was the highest-leverage finding in the review. The doc now has no NIC-side merging at all. keycloak.yaml is a separate file in kcc's native input format, written through to a cluster Secret verbatim. kcc's remote-state model does the reconciliation. The "removal via enabled: false" semantic is gone (it was also wrong in Keycloak's actual model: enabled: false disables, doesn't delete).

Also added the explicit caveat you suggested: "edits to NIC-managed resources via Keycloak UI will revert on next sync; edits to admin-created resources outside the input are preserved."

viniciusdc · 2026-05-08T19:14:20Z

+
+      # Authentication flows, client scopes, etc.
+      authenticationFlows: []
+      clientScopes: []


Reading this as illustrative rather than the actual default set, but flagging for the implementation pass: pkg/argocd/templates/manifests/keycloak/realm-setup-job.yaml:83-108 creates a groups client scope with an oidc-group-membership-mapper and registers it as a realm default-default. That scope is what puts groups claims into tokens, which is what makes NebariApplication.allowedGroups work in the operator.

Two safety nets are already in our favor:

kcc's IMPORT_MANAGED_CLIENT_SCOPE=no-delete default (set this way precisely because Keycloak auto-creates default scopes that kcc can't reliably distinguish from user-created ones)

IMPORT_REMOTESTATE_ENABLED=true (kcc only deletes scopes it itself created)

So "groups scope silently dropped from new defaults" is probably safe-by-omission, but only by accident. Worth walking the current shell script line-by-line during implementation and confirming each resource is represented in the new defaults, and adding an integration test that asserts post-migration token claims match today's setup. Catches it explicitly rather than relying on safety nets.

Important catch, and worse than the comment suggests. When I checked the realm-setup-job on this branch before the rewrite, the groups scope had been silently dropped (file was 83 lines vs main's 110, no oidc-group-membership-mapper). The recent merge from main brought back the file, so this branch is probably fine again, but it was a real regression for a moment.

The rewrite makes the groups client scope an explicit baked-in default and adds a migration integration test that asserts a token issued for a user in a group contains the groups claim. That contract is what NebariApplication.allowedGroups depends on, so a regression here would silently break operator-driven auth across the platform.

viniciusdc · 2026-05-08T19:14:20Z

+
+## Phasing
+
+**Phase 1**: Config merge + Job replacement - replace the shell script with keycloak-config-cli, add `keycloak` section to config.yaml, implement merge logic


Wanted to float a finer-grained split for Phase 1 to keep future doors open:

1a replace the shell script with kcc, using only baked-in defaults rendered into a static ConfigMap. No user keycloak: section yet, no $(env:*) scanning, no merge logic. Answers the narrow question: "does kcc replace the shell script cleanly, including the groups scope and operator-coexistence?"

1b add user overrides via the keycloak: section. With remote-state doing the heavy lifting (Comment 4), this is a much smaller surface than originally proposed.

1c add $(env:*) scanning and the user-secrets pipeline.

Each step is independently testable, and 1a alone is forward-compatible with both the current config.yaml direction and a hypothetical "skip 1b, do gitops overlays instead" alternative. Bundling all three commits to the user-facing schema in a way that's harder to walk back if we change our minds.

Practical case for 1a-first: it isolates the "kcc replaces kcadm" risk from the "is our schema right?" question. If 1a uncovers something weird about kcc managed-resource policies (which I don't think it will, given how well-documented the remote-state model is, but still), we discover that before any user-visible schema work.

Doesn't have to be three PRs: could be three commits in one PR with clear boundaries. The point is the order of validation, not the literal git topology.

Adopted as written. Phase 1a is kcc replaces kcadm with baked-in defaults only, 1b adds the user keycloak.yaml parsing path, 1c adds env substitution and the user-secrets Secret. Phase 1a will resolve the kcc remote-state behavior question in isolation before any user-facing schema work lands.

viniciusdc · 2026-05-08T19:14:20Z

+            clientSecret: "${env:GITHUB_CLIENT_SECRET}"
+```
+
+## Backup & Restore


Restating my earlier comment in my thread now that the rest of the design is more concrete: I think this whole section should be a follow-up doc.

The reasons get clearer the more I read:

Different code paths (CLI commands vs render-time pipeline)

Different operational concerns: nic backup keycloak needs cluster access to read keycloak-admin-credentials, then port-forward or exec to reach the Keycloak service. That's a new pattern not used elsewhere in NIC today.

Doesn't load-bear for Phase 1

The "restore is full replace, not selective merge" semantic conflicts with the rest of the proposal (which, after Comment 4's simplification, is more like "additive overlays on defaults"). After a restore, the user's config.yaml no longer reflects reality.

Splitting this out also lets us consider backup-via-Velero-PVC-snapshot vs. backup-via-realm-export-API as separate options, without entangling them with the realm-config pipeline.

Moved to a follow-up doc. The reasoning you gave (different code path, different operational pattern, the "full replace" semantic conflicts with the remote-state model) all stand. The rewrite has a short stub explaining the split.

Worth noting: with the file-based model, the manual backup story becomes "kc.sh export and sanitize the result" rather than a custom CLI. The Phase 2 work is mostly about automating that sanitization, not building a parallel snapshot system.

viniciusdc · 2026-05-08T19:14:20Z

+
+## Open Questions
+
+1. **keycloak-config-cli version pinning strategy** - keycloak-config-cli has strict version coupling with Keycloak (6.x targets Keycloak 25.x/26.x). How do we want to manage this? Pin in the Job manifest and document the compatibility mapping?


For the immediate question (current Keycloak 24.0): kcc v6.5.0 is the current stable, and 24.0.5 is in their CI matrix at .github/workflows/ci.yaml. The verified-compatible image is quay.io/adorsys/keycloak-config-cli:6.5.0-24.0.5. v6.5.0 also landed several idempotency fixes worth having (release notes).

For the longer-term version pinning question, I think there's a deeper one underneath — user-facing schema compatibility, not just runtime. If our keycloak: section mirrors kcc's input which mirrors Keycloak's realm export, a Keycloak major-version upgrade can rename/add/remove fields, and existing user config.yaml files break.

Three stances we could take:

A — Pin Keycloak aggressively. Treat upgrades as opt-in events with migration tooling.

B — NIC-owned subset schema. Translate to upstream at render time (the "no translation layer" claim in the doc would have to go).

C — Accept upstream as-is. Document that user configs may break across Keycloak upgrades.

Currently the doc implicitly picks C. Worth being explicit. (For "don't lock future paths" — A and C are reversible, B locks NIC into maintaining a translator forever.)

Took stance C explicitly. The rewrite says user keycloak.yaml files may need updates across Keycloak major-version upgrades and that we'll document the path in release notes for each Keycloak bump. C is consistent with the rest of the doc's "no NIC-side schema, pass-through to kcc" philosophy, and unlike B it doesn't lock NIC into a translator forever.

Pinning the verified-compatible image (quay.io/adorsys/keycloak-config-cli:6.5.0-24.0.5 for our current Keycloak 24.0) is now an explicit dependency in the doc.

viniciusdc · 2026-05-08T19:14:20Z

+
+1. **keycloak-config-cli version pinning strategy** - keycloak-config-cli has strict version coupling with Keycloak (6.x targets Keycloak 25.x/26.x). How do we want to manage this? Pin in the Job manifest and document the compatibility mapping?
+
+2. **Config validation** - Should `nic validate` check the merged Keycloak config against a schema before deploy, or is runtime failure from the Job acceptable initially?


With Comment 4's simplification (kcc owns reconciliation, NIC does shallow render-time merge), validation gets cheaper to spec. Phase 1a needs no validation because there's no user input. Phase 1b just needs:

Top-level shape (the keycloak: block parses as expected)

$(env:*) references resolve (we already have the env, just check before deploy)

Optional: basic field-name validation against a JSON schema generated from kcc's input format

I'd lean toward shipping at least the env-var resolution check at 1b time — failure mode "Job fails in ArgoCD with cryptic logs" is a much worse UX than "nic validate errors with missing env var: GITHUB_CLIENT_SECRET."

Punting "to later" usually means it doesn't land. If we agree on a minimal scope now we won't have to revisit.

Promoted from open question to Phase 1b deliverable. nic validate will run the literal-secret scanner, reject unresolved env refs in sensitive fields, and verify schema shape. Same logic runs in CI on PRs against keycloak.yaml. Agree with the framing: the "cryptic PostSync failure hours after merge" UX is the worst possible outcome and cheap to prevent.

viniciusdc · 2026-05-08T19:14:20Z

+
+**Phase 2**: Backup & restore - `nic backup keycloak`, `nic restore keycloak`, optional CronJob
+
+Phase 2 depends on Phase 1 being deployed and validated.


Suggesting an "Alternatives Considered" section. The doc reads as if config.yaml ownership is the only path, which may end up being right, but worth capturing explicitly because the trade-offs aren't obvious from the body of the proposal.

The main alternative to name: NIC ships a default kcc input as a static ConfigMap, users customize via the gitops repo directly (Kustomize overlays, raw ConfigMap edits, Helm values). NIC stays scoped to infrastructure; the keycloak-config-cli schema lives in the gitops repo, where it's already user-editable.

What we'd gain: NIC doesn't take ownership of a deeply nested, version-coupled application-level schema. The "schema coupling" question (Open Q #1) becomes less urgent because users own their overlay format. Validation surface area shrinks.

The trade-off is losing a single pane of glass via config.yaml. Users who want to customize Keycloak now have to learn Kustomize/gitops patterns, not just edit one YAML file.

I don't think we have to pick now, but capturing the alternative in the doc means the trade-off is on record, and Phase 1a (Comment 6) is forward-compatible with either outcome. We can defer the bigger decision until we have more user feedback on which path they prefer.

Added the section. Captured three alternatives: (A) the original embedded-in-config.yaml deep-merge design, (B) your gitops-overlay proposal, and (C) a Sealed-Secret-based variant for higher-trust environments. Each has rationale for rejection.

On B specifically: I think you were pointing at the right concern (don't couple NIC to a deeply nested upstream schema), but the answer that fell out was a separate keycloak.yaml file alongside config.yaml, applied via cluster Secrets rather than the gitops repo. NIC stays scoped to "read file, validate, plumb into cluster Secret" with no schema knowledge. The gitops repo never sees realm content. So we ended up taking the spirit of B without losing reproducibility-from-nic deploy-alone.

Major restructure addressing review feedback on PR #154: - Move from "ConfigMap in gitops repo" to "in-cluster Secrets, gitops repo holds only the Job manifest" so secrets never land in the gitops repo even as placeholders. - Replace the embedded keycloak: config.yaml section with a separate keycloak.yaml file in kcc's native input format, pass-through with no NIC-side merging. - Drop the (non-existent) managed-resource-prefix mechanism in favor of kcc's three-layer model: omit clients:, remote-state.enabled=true, managed.client=no-delete. - Fix kcc 6.x variable substitution syntax: $(env:VAR) not ${env:VAR}, and require import.var-substitution.enabled=true. - Spec the literal-secret scanner with field list, --allow-literal-secrets and --allow-credential-hashes escape hatches. - Add the groups client scope explicitly to the baked-in defaults plus migration test contract. - Move backup/restore to a follow-up doc. - Split Phase 1 into 1a/1b/1c (kcc replaces kcadm, then user file, then env substitution). - Add Alternatives Considered, CI Integration, Blue/Green sections. - Take stance C (accept upstream as-is) on Keycloak version coupling.

dcmcand · 2026-05-12T12:34:51Z

Pushed a rewrite (669458a) addressing the full review. Most of your structural calls landed in the doc as written: three-layer operator-coexistence model, drop the deep-merge engine, separate backup/restore, phase 1a/1b/1c split, explicit alternatives, stance C on version coupling. A couple of places (separate keycloak.yaml file rather than embedded section, in-cluster Secrets rather than ConfigMap with placeholders) the architecture went further in your direction than the original review suggested - context for those is in the individual thread replies. Verified the kcc claims against v6.5.0's application.properties and MANAGED.md before adopting.

dcmcand · 2026-05-12T12:35:45Z

@viniciusdc - replying to your March 23 thread comment: both addressed in the rewrite (669458a). Secrets now go through Kubernetes Secrets exclusively (no env vars from .env ever reach the gitops repo, and keycloak-config-user-secrets is populated from local env at deploy time). Backup/restore is split into a separate follow-up doc.

dcmcand · 2026-05-12T12:35:49Z

@viniciusdc - replying to your April 10 thread comment: real question, and it pushed the design in a better direction. The rewrite acknowledges that config.yaml was the wrong home for the realm definition: realm configs are too big, they leak into the wrong layer of abstraction, and they couple NIC to kcc's input format.

The compromise we landed on: keycloak.yaml is a separate file (kcc native format, no NIC-side schema). It lives alongside config.yaml in the infra repo. config.yaml just references it via keycloak.config_file:. NIC stays scoped to "read, validate, plumb into cluster Secret" - no merging, no schema knowledge.

Why not your full proposal (defaults shipped to gitops repo, customize via overlays): the source of truth ends up split between the infra repo and the gitops repo. We've decided to treat the gitops repo as untrusted for realm content at all (encrypted or otherwise), which makes "users customize via gitops" architecturally awkward. The file-based approach gives you the same "NIC stays scoped" property without that split.

…write Aligns Phase 1a with the agreed design in PR #154's rewrite (commit 669458a on design/declarative-keycloak-config). Previous structure put the kcc realm input in a ConfigMap rendered into the gitops repo. The rewrite's "Why no ConfigMap in the gitops repo" section explicitly rejects that pattern: - The gitops repo is treated as untrusted. Placeholder-only contracts ("file in repo only contains $(env:...), never literals") create a footgun where one operator mistake leaks a secret into git history. - Realm content beyond raw secrets is still sensitive (IdP names, redirect URI patterns, credential hashes from kc.sh export). New structure: - pkg/argocd/keycloak_defaults.yaml (new) — kcc realm input in native format, embedded into the NIC binary via //go:embed. Same content as the previous gitops ConfigMap data field, no ConfigMap wrapping. - pkg/argocd/keycloak_defaults.go (new) — RenderKeycloakDefaults() helper renders the template with TemplateData (Domain substitutes into the argocd redirect URI). - pkg/argocd/foundational.go — InstallFoundationalServices now renders the embedded defaults with cfg.Domain and writes the result to a new in-cluster Secret keycloak-config-import (keycloak namespace). createKeycloakImportSecret added as a sibling of the existing Secret creators; uses createSecret's no-op-if-exists semantics so re-deploys preserve any value already present in the cluster. - templates/manifests/keycloak/realm-setup-job.yaml — volume changed from a ConfigMap reference (keycloak-realm-config, deleted) to a Secret reference (keycloak-config-import). Everything else stays: same kcc image, same env-var sourcing, same hook annotations. - templates/manifests/keycloak/realm-config-cm.yaml (deleted) — the gitops repo no longer carries realm content. Only the Job manifest remains there. Phase 1a scope is unchanged. Phase 1b will introduce a user-supplied keycloak.yaml file path that overrides these embedded defaults; the Secret-vs-ConfigMap structural decision now makes that path direct. Test: - keycloak_defaults_test.go (new) asserts the load-bearing fields of the rendered kcc input — realm settings, scope lists, groups, argocd client, the oidc-group-membership-mapper, the full defaultDefaultClientScopes list, and that no unresolved Go-template markers leak into the output. - writer_test.go TestKeycloakRealmConfigCM removed (it tested the deleted ConfigMap template).

dcmcand · 2026-06-09T12:12:08Z

closed as accepted

Phase 1a of the declarative Keycloak configuration design proposed in #154. Replaces the imperative shell script in realm-setup-job.yaml with a declarative YAML applied by keycloak-config-cli (kcc) at PostSync. Changes: - Add `realm-config-cm.yaml`: ConfigMap carrying the kcc-format YAML for the nebari realm. Parity with the previous shell setup, including the `groups` client scope + group-membership mapper (load-bearing for operator-managed app authorization), the `argocd` OIDC client (PR #234), the `argocd-admins`/`argocd-viewers` groups, and the admin user's `argocd-admins` membership. - Rewrite `realm-setup-job.yaml`: pulls `quay.io/adorsys/keycloak-config-cli:6.5.0-26.5.5`, mounts the ConfigMap, and applies it against Keycloak. Secrets are mounted from the existing `nebari-realm-admin-credentials` and `argocd-oidc-client-secret` Secrets and substituted into the realm YAML by kcc at apply time via `$(env:VAR)` placeholders. Operator coexistence: - `IMPORT_REMOTESTATE_ENABLED=true` (kcc default since v6.x): kcc tracks resources it created in a Keycloak realm attribute. Re-runs only reconcile what kcc itself owns; operator-managed OAuth2 clients created at runtime via NebariApplication CRDs are invisible to kcc's delete logic. - `IMPORT_MANAGED_CLIENT=no-delete`: belt-and-braces. Even if a future input change accidentally removes the argocd client from `clients:`, kcc will not delete unlisted clients. The strict "omit `clients:` entirely" pattern from my #154 review was too restrictive: NIC owns the `argocd` client at bootstrap, so it has to be in the input. The two-layer protection above is the practical shape of the coexistence story. Phase 1a scope (per #154 review): only replaces the existing realm setup. No `keycloak:` section in `config.yaml`, no NIC-side env-var scanning, no merge engine. Defaults are static. Future phases add the user-facing schema (1b) and the user-secrets pipeline (1c). Migration: on existing deployments, kcc detects existing resources by their natural keys (realm name, role name, username, scope name) and updates them in place, adding them to remote state. First run on an already-configured cluster is a no-op for existing managed values; the pre-generated argocd OIDC client secret (same value on both sides) is re-applied verbatim. Refs #154.

…write Aligns Phase 1a with the agreed design in PR #154's rewrite (commit 669458a on design/declarative-keycloak-config). Previous structure put the kcc realm input in a ConfigMap rendered into the gitops repo. The rewrite's "Why no ConfigMap in the gitops repo" section explicitly rejects that pattern: - The gitops repo is treated as untrusted. Placeholder-only contracts ("file in repo only contains $(env:...), never literals") create a footgun where one operator mistake leaks a secret into git history. - Realm content beyond raw secrets is still sensitive (IdP names, redirect URI patterns, credential hashes from kc.sh export). New structure: - pkg/argocd/keycloak_defaults.yaml (new) — kcc realm input in native format, embedded into the NIC binary via //go:embed. Same content as the previous gitops ConfigMap data field, no ConfigMap wrapping. - pkg/argocd/keycloak_defaults.go (new) — RenderKeycloakDefaults() helper renders the template with TemplateData (Domain substitutes into the argocd redirect URI). - pkg/argocd/foundational.go — InstallFoundationalServices now renders the embedded defaults with cfg.Domain and writes the result to a new in-cluster Secret keycloak-config-import (keycloak namespace). createKeycloakImportSecret added as a sibling of the existing Secret creators; uses createSecret's no-op-if-exists semantics so re-deploys preserve any value already present in the cluster. - templates/manifests/keycloak/realm-setup-job.yaml — volume changed from a ConfigMap reference (keycloak-realm-config, deleted) to a Secret reference (keycloak-config-import). Everything else stays: same kcc image, same env-var sourcing, same hook annotations. - templates/manifests/keycloak/realm-config-cm.yaml (deleted) — the gitops repo no longer carries realm content. Only the Job manifest remains there. Phase 1a scope is unchanged. Phase 1b will introduce a user-supplied keycloak.yaml file path that overrides these embedded defaults; the Secret-vs-ConfigMap structural decision now makes that path direct. Test: - keycloak_defaults_test.go (new) asserts the load-bearing fields of the rendered kcc input — realm settings, scope lists, groups, argocd client, the oidc-group-membership-mapper, the full defaultDefaultClientScopes list, and that no unresolved Go-template markers leak into the output. - writer_test.go TestKeycloakRealmConfigCM removed (it tested the deleted ConfigMap template).

- Extract labelPartOf/labelManagedBy/labelManagedByValue constants so the import-secret labels stop tripping goconst (4th occurrence of the label keys). Fixes the lint failure blocking CI. - Instrument RenderKeycloakDefaults with an OTel span and thread ctx through from InstallFoundationalServices, per the pkg/ instrumentation convention. - Render with missingkey=error so a stray template field fails loudly instead of emitting "<no value>" into the realm config; add a guard assertion. - Expand TestRenderKeycloakDefaults to cover the security-relevant scalars (sslRequired, registrationAllowed, bruteForceProtected), the realm roles, the multivalued groups-mapper flag, the argocd client flags, and webOrigins. - Drop the dead doc path from keycloak_defaults.yaml; point at #154.

Add design doc: declarative Keycloak configuration

1b555fa

Proposes replacing the imperative realm-setup-job shell script with keycloak-config-cli for declarative Keycloak configuration. Covers config schema, defaults/merge strategy, backup/restore, and operator coexistence.

dcmcand added the needs: discussion 💬 Needs discussion with the rest of the team label Mar 17, 2026

Merge branch 'main' into design/declarative-keycloak-config

c4ba73a

dcmcand marked this pull request as ready for review March 30, 2026 14:50

dcmcand requested review from Adam-D-Lewis, aktech, dharhas, marcelovilla, tylerpotts and viniciusdc as code owners March 30, 2026 14:50

Merge branch 'main' into design/declarative-keycloak-config

d6106e0

viniciusdc mentioned this pull request May 8, 2026

Optional wait-for-realm (or wait-for-condition) on the platform profile nebari-dev/action-nebari-sandbox#27

Open

viniciusdc reviewed May 8, 2026

View reviewed changes

viniciusdc mentioned this pull request May 12, 2026

feat(keycloak): replace Keycloak realm setup with keycloak-config-cli #289

Draft

dcmcand added 2 commits May 12, 2026 14:21

Merge remote-tracking branch 'origin/main' into pr-154-review

28ee13d

dcmcand requested a review from viniciusdc May 12, 2026 13:03

dcmcand closed this Jun 9, 2026


		### Operator Coexistence

		keycloak-config-cli supports "managed resources" with a configurable prefix. Resources created by this system are tagged with a `nebari:` prefix. The operator's dynamically created OAuth2 clients are untagged and left untouched. This is the boundary:


		### Secrets Handling

		Secrets never go in `config.yaml`. The config references environment variable names using keycloak-config-cli's `${env:VAR}` substitution syntax. These references are written literally into the ConfigMap - no substitution happens in NIC.


		## Phasing

		Phase 1: Config merge + Job replacement - replace the shell script with keycloak-config-cli, add `keycloak` section to config.yaml, implement merge logic


		## Open Questions

		1. keycloak-config-cli version pinning strategy - keycloak-config-cli has strict version coupling with Keycloak (6.x targets Keycloak 25.x/26.x). How do we want to manage this? Pin in the Job manifest and document the compatibility mapping?


		1. keycloak-config-cli version pinning strategy - keycloak-config-cli has strict version coupling with Keycloak (6.x targets Keycloak 25.x/26.x). How do we want to manage this? Pin in the Job manifest and document the compatibility mapping?

		2. Config validation - Should `nic validate` check the merged Keycloak config against a schema before deploy, or is runtime failure from the Job acceptable initially?


		Phase 2: Backup & restore - `nic backup keycloak`, `nic restore keycloak`, optional CronJob

		Phase 2 depends on Phase 1 being deployed and validated.

Uh oh!

Conversation

dcmcand commented Mar 12, 2026

Summary

Uh oh!

viniciusdc commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viniciusdc commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viniciusdc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcmcand May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcmcand commented May 12, 2026

Uh oh!

dcmcand commented May 12, 2026

Uh oh!

dcmcand commented May 12, 2026

Uh oh!

dcmcand commented Jun 9, 2026

viniciusdc commented Mar 23, 2026 •

edited

Loading

viniciusdc commented Apr 10, 2026 •

edited

Loading

viniciusdc left a comment •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

dcmcand May 12, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading

viniciusdc May 8, 2026 •

edited

Loading