Skip to content

Design: Declarative Keycloak configuration#154

Closed
dcmcand wants to merge 5 commits into
mainfrom
design/declarative-keycloak-config
Closed

Design: Declarative Keycloak configuration#154
dcmcand wants to merge 5 commits into
mainfrom
design/declarative-keycloak-config

Conversation

@dcmcand

@dcmcand dcmcand commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Design doc proposing a declarative Keycloak configuration system to replace the current imperative realm-setup-job.yaml shell script.

  • Uses keycloak-config-cli to apply YAML config via Keycloak Admin API
  • Sensible defaults (nebari realm, roles, admin user) overridable via config.yaml
  • Backup/restore round-trip: export current state, restore to fresh instance
  • Safe coexistence with nebari-operator via managed resource prefixes

See docs/superpowers/specs/2026-03-12-declarative-keycloak-config-design.md for the full design.

Looking for feedback on the approach, config schema, and open questions at the bottom of the doc.

Proposes replacing the imperative realm-setup-job shell script with
keycloak-config-cli for declarative Keycloak configuration. Covers
config schema, defaults/merge strategy, backup/restore, and operator
coexistence.
@dcmcand dcmcand added the needs: discussion 💬 Needs discussion with the rest of the team label Mar 17, 2026
@viniciusdc

viniciusdc commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Overall, I like this approach. As discussed, the main advantage is the declarative model compared to what we have today.

That said, I have two suggestions:

  • For handling secrets and credentials declaratively, I do not think we should depend directly on plain environment variables. Instead, NIC should create Kubernetes Secrets for those values. The declarative config can still reference env vars, but they should be injected into the pod from Kubernetes Secrets so the sidecar container can perform its work correctly.

  • I think the backup and restore discussion should be separated from this proposal. The current document does not directly depend on those decisions, and backup/restore is honestly a different can of worms.

@dcmcand dcmcand marked this pull request as ready for review March 30, 2026 14:50
@viniciusdc

viniciusdc commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Hey @dcmcand, I was looking over this design proposal again, and one question popped into my mind: should the realm configuration really live in a new keycloak section in NIC's config.yaml?

Keycloak realm config (roles, identity providers, auth flows, users) feels like application-level policy rather than infrastructure. Pulling it into config.yaml means NIC has to own the deep-merge logic, ${env:*} scanning, Secret generation, and a schema that tracks keycloak-config-cli's input format — all for something that, IMO, sits outside NIC's natural domain (cloud resources, clusters, platform service deployment).

An alternative: NIC deploys the Keycloak Argo app with sensible defaults baked into the templates (same defaults the current shell script creates), and users customize by patching the gitops repo directly (Kustomize overlays, Helm values, or editing the ConfigMap). This would:

  • Keep NIC scoped to infrastructure — no new structs, no merge logic, no version coupling with keycloak-config-cli's input format
  • Let users configure any keycloak-config-cli feature without waiting for NIC to expose it
  • Let secrets be handled by whatever the cluster already uses (Sealed Secrets, External Secrets Operator, SOPS) instead of a NIC-specific env var scanning mechanism
  • Follow the same pattern as other Argo-managed workloads in the platform

The trade-off is losing the "single config.yaml controls everything" convenience, but users who need to customize Keycloak are likely already comfortable working with the gitops repo.

What do you think? Was there a specific reason for routing this through config.yaml rather than keeping it at the Argo app template level?

@viniciusdc viniciusdc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dcmcand, did a deep read of the proposal alongside the current code (pkg/argocd/foundational.go, realm-setup-job.yaml, pkg/argocd/writer.go) and also went digging into how kcc actually works under the hood. Leaving a batch of inline comments — treating these as discussion-starters since the ADR is WIP, not a "fix this" pass.

The two big ones are Comment 1 (operator coexistence, kcc has a simpler mechanism than the prefix model described in the doc, we can lean on import.remote-state.enabled=true instead) and Comment 4 (the proposed NIC-side deep-merge engine probably isn't needed if we use kcc's remote-state for reconciliation). Most other comments build on those.

Several findings end up simplifying the design rather than asking for more spec — that's the spirit here. Open to being wrong on any of these.


### Operator Coexistence

keycloak-config-cli supports "managed resources" with a configurable prefix. Resources created by this system are tagged with a `nebari:` prefix. The operator's dynamically created OAuth2 clients are untagged and left untouched. This is the boundary:

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went digging through kcc's docs and source to figure out which env vars back this prefix mechanism, and I think we can lean on something simpler. kcc doesn't actually have a name-prefix policy for managed resources: I checked application.properties, MANAGED.md, and the issue tracker. The closest equivalents are type-level: omit a resource type from the input entirely, or set IMPORT_MANAGED_<type>=no-delete.

There's a cleaner pattern that IMO gives us better operator-coexistence guarantees with less coupling:

  1. Don't include clients: in the input config at all. Per adorsys/keycloak-config-cli#1306, kcc treats "key missing" as "leave this resource type alone" (different from clients: [] which means "delete all"). Operator owns clients; NIC owns realm-level config.
  2. IMPORT_REMOTESTATE_ENABLED=true (default in 6.x): kcc tracks resources it itself created in a Keycloak realm attribute. Anything the operator creates at runtime is invisible to kcc's delete logic, even if clients: were in the input. Stored as de.adorsys.keycloak.config.state-* inside Keycloak's DB, no extra PVC.
  3. IMPORT_MANAGED_CLIENT=no-delete: belt-and-braces. Even if a future change accidentally listed clients in the input, kcc still wouldn't delete unlisted ones.

Three independent layers protecting operator-created clients, and we don't have to retrofit a nebari: prefix onto existing clients (which would be a migration concern in itself, since today's operator-created clients don't have one).

Worth rewriting this section around those primitives? Happy to draft something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right to dig into this - the prefix mechanism doesn't exist in kcc, and your three-layer alternative is meaningfully better. Doc rewrite (just pushed) leans on exactly what you proposed: omit clients: from the input entirely, rely on import.remote-state.enabled=true (the v6.x default), set import.managed.client=no-delete for belt-and-braces. That also kills the migration concern about retrofitting prefixes onto existing operator-created clients.

Verified against kcc v6.5.0's application.properties and MANAGED.md before adopting. The doc cites both in the rewrite.


### Secrets Handling

Secrets never go in `config.yaml`. The config references environment variable names using keycloak-config-cli's `${env:VAR}` substitution syntax. These references are written literally into the ConfigMap - no substitution happens in NIC.

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things to flag here.

Syntax: kcc 6.x uses $(env:VAR) with parens, not ${env:VAR} with braces: the prefix changed at v4.0 to avoid colliding with Keycloak's own builtin variables like ${role_uma_authorization}. Default-value form $(env:VAR:fallback) works too. Worth fixing in the schema example below so users don't copy-paste a syntax that kcc won't substitute.

Substitution is also off by default (import.var-substitution.enabled=false), so we'll need IMPORT_VARSUBSTITUTION_ENABLED=true explicitly in the Job manifest. With IMPORT_VARSUBSTITUTION_UNDEFINEDISERROR=true (default) we get fail-fast on missing env vars — which is the behavior described here.

Bigger thing: secret lifecycle. Reading this alongside pkg/argocd/foundational.go:222-292, I think we want to be explicit about which pipeline owns which secrets. Today NIC creates four Secrets directly (keycloak-admin-credentials, nebari-realm-admin-credentials, etc.) with no-op-if-exists semantics that let operators rotate in-cluster without nic deploy overwriting them.

I think both pipelines should coexist, scoped clearly:

  • NIC-generated secrets (admin password, realm admin, DB) → keep current pipeline. Not user-supplied; rotation is by re-generation.
  • User-supplied via $(env:VAR) (IdP client secrets, custom credentials) → new pipeline, separate Secret name, overwrite-on-deploy (since .env is the source of truth for these).

Could the doc spec the new Secret's name/namespace and confirm nebari-realm-admin-credentials keeps its current handling? Reframing my earlier thread comment

@dcmcand dcmcand May 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both points addressed in the rewrite. Every ${env:VAR} is now $(env:VAR) and the doc calls out that import.var-substitution.enabled=true has to be set explicitly on the Job (kcc's default is off).

For the secret pipeline: split as you suggested into NIC-generated (admin, DB, realm-admin - existing pipeline at foundational.go:222-292 unchanged with its no-op-if-exists semantics) and user-supplied (new Secret keycloak-config-user-secrets, overwrite-on-deploy because .env is the source of truth). Both Secrets mount to the kcc Job. nebari-realm-admin-credentials keeps its current handling exactly.

enabled: true
config:
clientId: "${env:GITHUB_CLIENT_ID}"
clientSecret: "${env:GITHUB_CLIENT_SECRET}"

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick follow-up to the syntax note above: once we settle on $(env:...), this example block needs the parens too:

clientSecret: "$(env:GITHUB_CLIENT_SECRET)"
# ...
value: "$(env:REALM_ADMIN_PASSWORD)"

Default-value form is also useful for IdP configs where missing env vars should fall back to disabled-by-default rather than failing the deploy:

clientId: "$(env:GITHUB_CLIENT_ID:disabled)"

Optional: depends on whether we want missing IdP secrets to fail loudly or be silently disabled. IMO fail-loud for the realm admin password and other load-bearing values, fall-back-to-disabled for optional IdPs. Worth a paragraph on the philosophy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed throughout the examples. On the fail-loud vs fall-back-to-disabled question: the rewrite takes a stronger stance than just philosophy. NIC's pre-deploy scanner will block deploys where a sensitive field (identityProviders[].config.clientSecret, etc.) references an env var that resolves to empty or unset. The reasoning: a silent fallback for IdP secrets means GitHub login silently breaks on next deploy, which is a much worse failure mode than "deploy errored, fix your env." For non-secret fields where a missing-equals-disabled flow makes sense, kcc's own substitution handles it.


- **Scalars**: config.yaml value replaces default
- **Lists** (users, roles, identityProviders): merged by key field (`username`, `name`, `alias`). Matching key overrides that entry, new keys are appended
- **Removal**: set `enabled: false` on a default entry

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section can simplify a lot once we lean on kcc's remote-state behavior. Reading it now, NIC takes on:

  • A deep-merge engine over a deeply nested upstream schema (Keycloak realm export)
  • List-merge-by-key rules that we'd need to spec for ~10 list-shaped fields beyond the three named here: clientScopes, authenticationFlows[].authenticationExecutions[], recursive groups, requiredActions, etc.
  • A "removal" semantic via enabled: false (which in Keycloak actually means disable, not delete: the alias/username still occupies its namespace)

If we set IMPORT_REMOTESTATE_ENABLED=true (default) and let kcc do the reconciliation, NIC's job shrinks to:

  1. Render baked-in defaults (current realm-setup-job's resources) as a base config
  2. Render the user's keycloak: section over the base: shallow merge by top-level key, kcc handles the deep diff downstream
  3. Write the result as a ConfigMap

What this gives us:

  • First run: kcc creates everything declared.
  • Subsequent runs: kcc only touches resources it itself created (tracked in de.adorsys.keycloak.config.state-* realm attribute). Direct-in-Keycloak admin edits to unmanaged resources are preserved: which IMO matches the intent of "don't wipe data important to the user state of the cluster."
  • User-facing: users only declare changes, not the whole realm. A user adding a GitHub IdP just declares the IdP; defaults stay defaulted.

The "removal via enabled: false" thing goes away entirely (kcc manages deletions through its own state), and we don't have to spec merge keys for every nested list shape.

The trade-off: kcc reconciles back to input on next run for resources it does manage. That's by design and matches what users expect from a declarative tool, but worth being explicit in the doc — "edits to NIC-managed resources via Keycloak UI will be reverted on next sync; edits to admin-created resources outside the input are preserved."

Is there a reason we wanted NIC to own merge logic that I'm missing? My read is it adds surface without buying anything kcc doesn't already give us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the highest-leverage finding in the review. The doc now has no NIC-side merging at all. keycloak.yaml is a separate file in kcc's native input format, written through to a cluster Secret verbatim. kcc's remote-state model does the reconciliation. The "removal via enabled: false" semantic is gone (it was also wrong in Keycloak's actual model: enabled: false disables, doesn't delete).

Also added the explicit caveat you suggested: "edits to NIC-managed resources via Keycloak UI will revert on next sync; edits to admin-created resources outside the input are preserved."


# Authentication flows, client scopes, etc.
authenticationFlows: []
clientScopes: []

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this as illustrative rather than the actual default set, but flagging for the implementation pass: pkg/argocd/templates/manifests/keycloak/realm-setup-job.yaml:83-108 creates a groups client scope with an oidc-group-membership-mapper and registers it as a realm default-default. That scope is what puts groups claims into tokens, which is what makes NebariApplication.allowedGroups work in the operator.

Two safety nets are already in our favor:

  • kcc's IMPORT_MANAGED_CLIENT_SCOPE=no-delete default (set this way precisely because Keycloak auto-creates default scopes that kcc can't reliably distinguish from user-created ones)
  • IMPORT_REMOTESTATE_ENABLED=true (kcc only deletes scopes it itself created)

So "groups scope silently dropped from new defaults" is probably safe-by-omission, but only by accident. Worth walking the current shell script line-by-line during implementation and confirming each resource is represented in the new defaults, and adding an integration test that asserts post-migration token claims match today's setup. Catches it explicitly rather than relying on safety nets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important catch, and worse than the comment suggests. When I checked the realm-setup-job on this branch before the rewrite, the groups scope had been silently dropped (file was 83 lines vs main's 110, no oidc-group-membership-mapper). The recent merge from main brought back the file, so this branch is probably fine again, but it was a real regression for a moment.

The rewrite makes the groups client scope an explicit baked-in default and adds a migration integration test that asserts a token issued for a user in a group contains the groups claim. That contract is what NebariApplication.allowedGroups depends on, so a regression here would silently break operator-driven auth across the platform.


## Phasing

**Phase 1**: Config merge + Job replacement - replace the shell script with keycloak-config-cli, add `keycloak` section to config.yaml, implement merge logic

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to float a finer-grained split for Phase 1 to keep future doors open:

  • 1a replace the shell script with kcc, using only baked-in defaults rendered into a static ConfigMap. No user keycloak: section yet, no $(env:*) scanning, no merge logic. Answers the narrow question: "does kcc replace the shell script cleanly, including the groups scope and operator-coexistence?"
  • 1b add user overrides via the keycloak: section. With remote-state doing the heavy lifting (Comment 4), this is a much smaller surface than originally proposed.
  • 1c add $(env:*) scanning and the user-secrets pipeline.

Each step is independently testable, and 1a alone is forward-compatible with both the current config.yaml direction and a hypothetical "skip 1b, do gitops overlays instead" alternative. Bundling all three commits to the user-facing schema in a way that's harder to walk back if we change our minds.

Practical case for 1a-first: it isolates the "kcc replaces kcadm" risk from the "is our schema right?" question. If 1a uncovers something weird about kcc managed-resource policies (which I don't think it will, given how well-documented the remote-state model is, but still), we discover that before any user-visible schema work.

Doesn't have to be three PRs: could be three commits in one PR with clear boundaries. The point is the order of validation, not the literal git topology.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted as written. Phase 1a is kcc replaces kcadm with baked-in defaults only, 1b adds the user keycloak.yaml parsing path, 1c adds env substitution and the user-secrets Secret. Phase 1a will resolve the kcc remote-state behavior question in isolation before any user-facing schema work lands.

clientSecret: "${env:GITHUB_CLIENT_SECRET}"
```

## Backup & Restore

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restating my earlier comment in my thread now that the rest of the design is more concrete: I think this whole section should be a follow-up doc.

The reasons get clearer the more I read:

  • Different code paths (CLI commands vs render-time pipeline)
  • Different operational concerns: nic backup keycloak needs cluster access to read keycloak-admin-credentials, then port-forward or exec to reach the Keycloak service. That's a new pattern not used elsewhere in NIC today.
  • Doesn't load-bear for Phase 1
  • The "restore is full replace, not selective merge" semantic conflicts with the rest of the proposal (which, after Comment 4's simplification, is more like "additive overlays on defaults"). After a restore, the user's config.yaml no longer reflects reality.

Splitting this out also lets us consider backup-via-Velero-PVC-snapshot vs. backup-via-realm-export-API as separate options, without entangling them with the realm-config pipeline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a follow-up doc. The reasoning you gave (different code path, different operational pattern, the "full replace" semantic conflicts with the remote-state model) all stand. The rewrite has a short stub explaining the split.

Worth noting: with the file-based model, the manual backup story becomes "kc.sh export and sanitize the result" rather than a custom CLI. The Phase 2 work is mostly about automating that sanitization, not building a parallel snapshot system.


## Open Questions

1. **keycloak-config-cli version pinning strategy** - keycloak-config-cli has strict version coupling with Keycloak (6.x targets Keycloak 25.x/26.x). How do we want to manage this? Pin in the Job manifest and document the compatibility mapping?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the immediate question (current Keycloak 24.0): kcc v6.5.0 is the current stable, and 24.0.5 is in their CI matrix at .github/workflows/ci.yaml. The verified-compatible image is quay.io/adorsys/keycloak-config-cli:6.5.0-24.0.5. v6.5.0 also landed several idempotency fixes worth having (release notes).

For the longer-term version pinning question, I think there's a deeper one underneath — user-facing schema compatibility, not just runtime. If our keycloak: section mirrors kcc's input which mirrors Keycloak's realm export, a Keycloak major-version upgrade can rename/add/remove fields, and existing user config.yaml files break.

Three stances we could take:

  • A — Pin Keycloak aggressively. Treat upgrades as opt-in events with migration tooling.
  • B — NIC-owned subset schema. Translate to upstream at render time (the "no translation layer" claim in the doc would have to go).
  • C — Accept upstream as-is. Document that user configs may break across Keycloak upgrades.

Currently the doc implicitly picks C. Worth being explicit. (For "don't lock future paths" — A and C are reversible, B locks NIC into maintaining a translator forever.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took stance C explicitly. The rewrite says user keycloak.yaml files may need updates across Keycloak major-version upgrades and that we'll document the path in release notes for each Keycloak bump. C is consistent with the rest of the doc's "no NIC-side schema, pass-through to kcc" philosophy, and unlike B it doesn't lock NIC into a translator forever.

Pinning the verified-compatible image (quay.io/adorsys/keycloak-config-cli:6.5.0-24.0.5 for our current Keycloak 24.0) is now an explicit dependency in the doc.


1. **keycloak-config-cli version pinning strategy** - keycloak-config-cli has strict version coupling with Keycloak (6.x targets Keycloak 25.x/26.x). How do we want to manage this? Pin in the Job manifest and document the compatibility mapping?

2. **Config validation** - Should `nic validate` check the merged Keycloak config against a schema before deploy, or is runtime failure from the Job acceptable initially?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Comment 4's simplification (kcc owns reconciliation, NIC does shallow render-time merge), validation gets cheaper to spec. Phase 1a needs no validation because there's no user input. Phase 1b just needs:

  • Top-level shape (the keycloak: block parses as expected)
  • $(env:*) references resolve (we already have the env, just check before deploy)
  • Optional: basic field-name validation against a JSON schema generated from kcc's input format

I'd lean toward shipping at least the env-var resolution check at 1b time — failure mode "Job fails in ArgoCD with cryptic logs" is a much worse UX than "nic validate errors with missing env var: GITHUB_CLIENT_SECRET."

Punting "to later" usually means it doesn't land. If we agree on a minimal scope now we won't have to revisit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promoted from open question to Phase 1b deliverable. nic validate will run the literal-secret scanner, reject unresolved env refs in sensitive fields, and verify schema shape. Same logic runs in CI on PRs against keycloak.yaml. Agree with the framing: the "cryptic PostSync failure hours after merge" UX is the worst possible outcome and cheap to prevent.


**Phase 2**: Backup & restore - `nic backup keycloak`, `nic restore keycloak`, optional CronJob

Phase 2 depends on Phase 1 being deployed and validated.

@viniciusdc viniciusdc May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting an "Alternatives Considered" section. The doc reads as if config.yaml ownership is the only path, which may end up being right, but worth capturing explicitly because the trade-offs aren't obvious from the body of the proposal.

The main alternative to name: NIC ships a default kcc input as a static ConfigMap, users customize via the gitops repo directly (Kustomize overlays, raw ConfigMap edits, Helm values). NIC stays scoped to infrastructure; the keycloak-config-cli schema lives in the gitops repo, where it's already user-editable.

What we'd gain: NIC doesn't take ownership of a deeply nested, version-coupled application-level schema. The "schema coupling" question (Open Q #1) becomes less urgent because users own their overlay format. Validation surface area shrinks.

The trade-off is losing a single pane of glass via config.yaml. Users who want to customize Keycloak now have to learn Kustomize/gitops patterns, not just edit one YAML file.

I don't think we have to pick now, but capturing the alternative in the doc means the trade-off is on record, and Phase 1a (Comment 6) is forward-compatible with either outcome. We can defer the bigger decision until we have more user feedback on which path they prefer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the section. Captured three alternatives: (A) the original embedded-in-config.yaml deep-merge design, (B) your gitops-overlay proposal, and (C) a Sealed-Secret-based variant for higher-trust environments. Each has rationale for rejection.

On B specifically: I think you were pointing at the right concern (don't couple NIC to a deeply nested upstream schema), but the answer that fell out was a separate keycloak.yaml file alongside config.yaml, applied via cluster Secrets rather than the gitops repo. NIC stays scoped to "read file, validate, plumb into cluster Secret" with no schema knowledge. The gitops repo never sees realm content. So we ended up taking the spirit of B without losing reproducibility-from-nic deploy-alone.

dcmcand added 2 commits May 12, 2026 14:21
Major restructure addressing review feedback on PR #154:

- Move from "ConfigMap in gitops repo" to "in-cluster Secrets, gitops repo
  holds only the Job manifest" so secrets never land in the gitops repo
  even as placeholders.
- Replace the embedded keycloak: config.yaml section with a separate
  keycloak.yaml file in kcc's native input format, pass-through with no
  NIC-side merging.
- Drop the (non-existent) managed-resource-prefix mechanism in favor of
  kcc's three-layer model: omit clients:, remote-state.enabled=true,
  managed.client=no-delete.
- Fix kcc 6.x variable substitution syntax: $(env:VAR) not ${env:VAR},
  and require import.var-substitution.enabled=true.
- Spec the literal-secret scanner with field list, --allow-literal-secrets
  and --allow-credential-hashes escape hatches.
- Add the groups client scope explicitly to the baked-in defaults plus
  migration test contract.
- Move backup/restore to a follow-up doc.
- Split Phase 1 into 1a/1b/1c (kcc replaces kcadm, then user file, then
  env substitution).
- Add Alternatives Considered, CI Integration, Blue/Green sections.
- Take stance C (accept upstream as-is) on Keycloak version coupling.
@dcmcand

dcmcand commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a rewrite (669458a) addressing the full review. Most of your structural calls landed in the doc as written: three-layer operator-coexistence model, drop the deep-merge engine, separate backup/restore, phase 1a/1b/1c split, explicit alternatives, stance C on version coupling. A couple of places (separate keycloak.yaml file rather than embedded section, in-cluster Secrets rather than ConfigMap with placeholders) the architecture went further in your direction than the original review suggested - context for those is in the individual thread replies. Verified the kcc claims against v6.5.0's application.properties and MANAGED.md before adopting.

@dcmcand

dcmcand commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

@viniciusdc - replying to your March 23 thread comment: both addressed in the rewrite (669458a). Secrets now go through Kubernetes Secrets exclusively (no env vars from .env ever reach the gitops repo, and keycloak-config-user-secrets is populated from local env at deploy time). Backup/restore is split into a separate follow-up doc.

@dcmcand

dcmcand commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

@viniciusdc - replying to your April 10 thread comment: real question, and it pushed the design in a better direction. The rewrite acknowledges that config.yaml was the wrong home for the realm definition: realm configs are too big, they leak into the wrong layer of abstraction, and they couple NIC to kcc's input format.

The compromise we landed on: keycloak.yaml is a separate file (kcc native format, no NIC-side schema). It lives alongside config.yaml in the infra repo. config.yaml just references it via keycloak.config_file:. NIC stays scoped to "read, validate, plumb into cluster Secret" - no merging, no schema knowledge.

Why not your full proposal (defaults shipped to gitops repo, customize via overlays): the source of truth ends up split between the infra repo and the gitops repo. We've decided to treat the gitops repo as untrusted for realm content at all (encrypted or otherwise), which makes "users customize via gitops" architecturally awkward. The file-based approach gives you the same "NIC stays scoped" property without that split.

@dcmcand dcmcand requested a review from viniciusdc May 12, 2026 13:03
viniciusdc added a commit that referenced this pull request May 12, 2026
…write

Aligns Phase 1a with the agreed design in PR #154's rewrite (commit
669458a on design/declarative-keycloak-config). Previous structure put
the kcc realm input in a ConfigMap rendered into the gitops repo. The
rewrite's "Why no ConfigMap in the gitops repo" section explicitly
rejects that pattern:

- The gitops repo is treated as untrusted. Placeholder-only contracts
  ("file in repo only contains $(env:...), never literals") create a
  footgun where one operator mistake leaks a secret into git history.
- Realm content beyond raw secrets is still sensitive (IdP names,
  redirect URI patterns, credential hashes from kc.sh export).

New structure:

- pkg/argocd/keycloak_defaults.yaml (new) — kcc realm input in native
  format, embedded into the NIC binary via //go:embed. Same content
  as the previous gitops ConfigMap data field, no ConfigMap wrapping.
- pkg/argocd/keycloak_defaults.go (new) — RenderKeycloakDefaults()
  helper renders the template with TemplateData (Domain substitutes
  into the argocd redirect URI).
- pkg/argocd/foundational.go — InstallFoundationalServices now renders
  the embedded defaults with cfg.Domain and writes the result to a
  new in-cluster Secret keycloak-config-import (keycloak namespace).
  createKeycloakImportSecret added as a sibling of the existing
  Secret creators; uses createSecret's no-op-if-exists semantics so
  re-deploys preserve any value already present in the cluster.
- templates/manifests/keycloak/realm-setup-job.yaml — volume changed
  from a ConfigMap reference (keycloak-realm-config, deleted) to a
  Secret reference (keycloak-config-import). Everything else stays:
  same kcc image, same env-var sourcing, same hook annotations.
- templates/manifests/keycloak/realm-config-cm.yaml (deleted) — the
  gitops repo no longer carries realm content. Only the Job manifest
  remains there.

Phase 1a scope is unchanged. Phase 1b will introduce a user-supplied
keycloak.yaml file path that overrides these embedded defaults; the
Secret-vs-ConfigMap structural decision now makes that path direct.

Test:

- keycloak_defaults_test.go (new) asserts the load-bearing fields of
  the rendered kcc input — realm settings, scope lists, groups,
  argocd client, the oidc-group-membership-mapper, the full
  defaultDefaultClientScopes list, and that no unresolved Go-template
  markers leak into the output.
- writer_test.go TestKeycloakRealmConfigCM removed (it tested the
  deleted ConfigMap template).
@dcmcand

dcmcand commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

closed as accepted

@dcmcand dcmcand closed this Jun 9, 2026
viniciusdc added a commit that referenced this pull request Jun 19, 2026
Phase 1a of the declarative Keycloak configuration design proposed in
#154. Replaces the imperative shell script in realm-setup-job.yaml with
a declarative YAML applied by keycloak-config-cli (kcc) at PostSync.

Changes:

- Add `realm-config-cm.yaml`: ConfigMap carrying the kcc-format YAML for
  the nebari realm. Parity with the previous shell setup, including the
  `groups` client scope + group-membership mapper (load-bearing for
  operator-managed app authorization), the `argocd` OIDC client (PR
  #234), the `argocd-admins`/`argocd-viewers` groups, and the admin
  user's `argocd-admins` membership.
- Rewrite `realm-setup-job.yaml`: pulls
  `quay.io/adorsys/keycloak-config-cli:6.5.0-26.5.5`, mounts the
  ConfigMap, and applies it against Keycloak. Secrets are mounted from
  the existing `nebari-realm-admin-credentials` and
  `argocd-oidc-client-secret` Secrets and substituted into the realm
  YAML by kcc at apply time via `$(env:VAR)` placeholders.

Operator coexistence:

- `IMPORT_REMOTESTATE_ENABLED=true` (kcc default since v6.x): kcc tracks
  resources it created in a Keycloak realm attribute. Re-runs only
  reconcile what kcc itself owns; operator-managed OAuth2 clients
  created at runtime via NebariApplication CRDs are invisible to kcc's
  delete logic.
- `IMPORT_MANAGED_CLIENT=no-delete`: belt-and-braces. Even if a future
  input change accidentally removes the argocd client from `clients:`,
  kcc will not delete unlisted clients.

The strict "omit `clients:` entirely" pattern from my #154 review was
too restrictive: NIC owns the `argocd` client at bootstrap, so it has
to be in the input. The two-layer protection above is the practical
shape of the coexistence story.

Phase 1a scope (per #154 review): only replaces the existing realm
setup. No `keycloak:` section in `config.yaml`, no NIC-side env-var
scanning, no merge engine. Defaults are static. Future phases add the
user-facing schema (1b) and the user-secrets pipeline (1c).

Migration: on existing deployments, kcc detects existing resources by
their natural keys (realm name, role name, username, scope name) and
updates them in place, adding them to remote state. First run on an
already-configured cluster is a no-op for existing managed values; the
pre-generated argocd OIDC client secret (same value on both sides) is
re-applied verbatim.

Refs #154.
viniciusdc added a commit that referenced this pull request Jun 19, 2026
…write

Aligns Phase 1a with the agreed design in PR #154's rewrite (commit
669458a on design/declarative-keycloak-config). Previous structure put
the kcc realm input in a ConfigMap rendered into the gitops repo. The
rewrite's "Why no ConfigMap in the gitops repo" section explicitly
rejects that pattern:

- The gitops repo is treated as untrusted. Placeholder-only contracts
  ("file in repo only contains $(env:...), never literals") create a
  footgun where one operator mistake leaks a secret into git history.
- Realm content beyond raw secrets is still sensitive (IdP names,
  redirect URI patterns, credential hashes from kc.sh export).

New structure:

- pkg/argocd/keycloak_defaults.yaml (new) — kcc realm input in native
  format, embedded into the NIC binary via //go:embed. Same content
  as the previous gitops ConfigMap data field, no ConfigMap wrapping.
- pkg/argocd/keycloak_defaults.go (new) — RenderKeycloakDefaults()
  helper renders the template with TemplateData (Domain substitutes
  into the argocd redirect URI).
- pkg/argocd/foundational.go — InstallFoundationalServices now renders
  the embedded defaults with cfg.Domain and writes the result to a
  new in-cluster Secret keycloak-config-import (keycloak namespace).
  createKeycloakImportSecret added as a sibling of the existing
  Secret creators; uses createSecret's no-op-if-exists semantics so
  re-deploys preserve any value already present in the cluster.
- templates/manifests/keycloak/realm-setup-job.yaml — volume changed
  from a ConfigMap reference (keycloak-realm-config, deleted) to a
  Secret reference (keycloak-config-import). Everything else stays:
  same kcc image, same env-var sourcing, same hook annotations.
- templates/manifests/keycloak/realm-config-cm.yaml (deleted) — the
  gitops repo no longer carries realm content. Only the Job manifest
  remains there.

Phase 1a scope is unchanged. Phase 1b will introduce a user-supplied
keycloak.yaml file path that overrides these embedded defaults; the
Secret-vs-ConfigMap structural decision now makes that path direct.

Test:

- keycloak_defaults_test.go (new) asserts the load-bearing fields of
  the rendered kcc input — realm settings, scope lists, groups,
  argocd client, the oidc-group-membership-mapper, the full
  defaultDefaultClientScopes list, and that no unresolved Go-template
  markers leak into the output.
- writer_test.go TestKeycloakRealmConfigCM removed (it tested the
  deleted ConfigMap template).
viniciusdc added a commit that referenced this pull request Jun 19, 2026
- Extract labelPartOf/labelManagedBy/labelManagedByValue constants so the
  import-secret labels stop tripping goconst (4th occurrence of the label
  keys). Fixes the lint failure blocking CI.
- Instrument RenderKeycloakDefaults with an OTel span and thread ctx through
  from InstallFoundationalServices, per the pkg/ instrumentation convention.
- Render with missingkey=error so a stray template field fails loudly instead
  of emitting "<no value>" into the realm config; add a guard assertion.
- Expand TestRenderKeycloakDefaults to cover the security-relevant scalars
  (sslRequired, registrationAllowed, bruteForceProtected), the realm roles,
  the multivalued groups-mapper flag, the argocd client flags, and webOrigins.
- Drop the dead doc path from keycloak_defaults.yaml; point at #154.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs: discussion 💬 Needs discussion with the rest of the team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants