Problem
The design docs under docs/design-doc/ have drifted substantially from the codebase. A full audit of all 18 files against actual code found that most files contain meaningfully inaccurate claims: invented CLI commands, wrong package layouts, fictional code samples, wrong YAML config schemas, and a foundational software stack (LGTM) that is not deployed.
Code is the source of truth; docs need to match it.
High-level drift
Provider story (cuts across most docs):
- Docs assume a uniform OpenTofu-driven multi-cloud implementation. Reality: only AWS uses tofu. Hetzner uses the
hetzner-k3s binary, existing is a no-op, local is a Kind stub via make localkind-up, and gcp/azure are stubs that return "not yet implemented". Hetzner and existing providers are not mentioned in the design docs at all.
- ADR-0004 (out-of-tree provider plugins, Proposed 2026-04-15) is not cross-referenced anywhere in the design doc.
Package layout:
- Many docs reference a root-level
terraform/ directory with modules/{aws,gcp,azure,local,kubernetes,argocd,foundational-apps}/. None of that exists. AWS templates live at pkg/provider/aws/templates/. There are no GCP/Azure/local terraform modules at all.
pkg/operator, pkg/tofu/executor.go, pkg/tofu/workspace.go, pkg/tofu/outputs.go, pkg/kubernetes/, api/v1alpha1/ are referenced but do not exist.
CLI commands:
- Docs reference
nic status, nic plan, nic state list/show/rm/mv, nic unlock, nic init-backend, nic health check, nic stack ..., nic init, nic marketplace. None of these exist. Real verbs: deploy, destroy, validate, kubeconfig, version.
Foundational software:
- Docs claim the LGTM stack (Grafana, Loki, Mimir, Tempo, Promtail) is deployed. It is not. Only
opentelemetry-collector is shipped from that family. Actual apps under pkg/argocd/templates/apps/: cert-manager, cluster-issuers, certificates, envoy-gateway, gateway-config, httproutes, keycloak, metallb, metallb-config, nebari-landingpage, nebari-operator, opentelemetry-collector, postgresql, root.
Config schema:
- The configuration reference documents a top-level
provider: field with sibling amazon_web_services: / google_cloud_platform: / azure: / hetzner_cloud: / local: keys. The real schema (per pkg/config/config.go and all examples/*.yaml) is cluster.<provider-name>: discriminator with no top-level provider: field. Only the Hetzner section currently matches reality.
- The reference doc is missing the
certificate:, git_repository:, and existing provider sections entirely.
CRD name:
nic-summary.md, 17-appendix.md, 11-nebari-operator.md, 12-testing-strategy.md, 13-milestones.md all reference NicApp / NebariApplication. The real CRD (from the upstream nebari-operator repo) is NebariApp.
Misframed:
11-nebari-operator.md describes the operator as if implemented in this repo. The operator is an external project at github.com/nebari-dev/nebari-operator. NIC only deploys it via ArgoCD.
State management (05-state-management.md):
- Documents DynamoDB-based locking. Real backend (
pkg/provider/aws/templates/backend.tf) uses S3-native use_lockfile = true. DynamoDB is not used anywhere.
- Documents a
state_backend: config block. No such block exists; bucket naming is deterministic per pkg/provider/aws/state.go.
Testing/CI (12-testing-strategy.md, 13-milestones.md):
- Mocking libraries listed (
moto, fake-gcs-server, azurite) are not used. Real test infra: LocalStack via docker-compose.test.yml + make test-integration-local.
- Documented CI YAML doesn't match
.github/workflows/ci.yml (wrong Go version, wrong test command, fictional jobs).
- Milestones doc marks GCP/Azure providers, multi-cloud CI, LGTM stack, Grafana dashboards, and v1.0.0 release as ✅ done; none are.
Proposed work
Full rewrite of the heavily-drifted docs against current code:
architecture/02-system-overview.md, 04-key-decisions.md, 05-state-management.md
implementation/06-opentofu-module-architecture.md, 07-configuration-design.md, 08-terraform-exec-integration.md, 10-foundational-software.md, 11-nebari-operator.md
appendix/16-configuration-reference.md
operations/12-testing-strategy.md, 13-milestones.md
Surgical edits for the docs that are directionally correct:
architecture/01-introduction.md, 03-goals-and-non-goals.md
implementation/09-dns-provider-architecture.md
appendix/14-open-questions.md, 15-future-enhancements.md, 17-appendix.md
operations/longhorn-node-maintenance.md
nic-summary.md
Definition of done
Problem
The design docs under
docs/design-doc/have drifted substantially from the codebase. A full audit of all 18 files against actual code found that most files contain meaningfully inaccurate claims: invented CLI commands, wrong package layouts, fictional code samples, wrong YAML config schemas, and a foundational software stack (LGTM) that is not deployed.Code is the source of truth; docs need to match it.
High-level drift
Provider story (cuts across most docs):
hetzner-k3sbinary,existingis a no-op,localis a Kind stub viamake localkind-up, andgcp/azureare stubs that return "not yet implemented". Hetzner andexistingproviders are not mentioned in the design docs at all.Package layout:
terraform/directory withmodules/{aws,gcp,azure,local,kubernetes,argocd,foundational-apps}/. None of that exists. AWS templates live atpkg/provider/aws/templates/. There are no GCP/Azure/local terraform modules at all.pkg/operator,pkg/tofu/executor.go,pkg/tofu/workspace.go,pkg/tofu/outputs.go,pkg/kubernetes/,api/v1alpha1/are referenced but do not exist.CLI commands:
nic status,nic plan,nic state list/show/rm/mv,nic unlock,nic init-backend,nic health check,nic stack ...,nic init,nic marketplace. None of these exist. Real verbs:deploy,destroy,validate,kubeconfig,version.Foundational software:
opentelemetry-collectoris shipped from that family. Actual apps underpkg/argocd/templates/apps/: cert-manager, cluster-issuers, certificates, envoy-gateway, gateway-config, httproutes, keycloak, metallb, metallb-config, nebari-landingpage, nebari-operator, opentelemetry-collector, postgresql, root.Config schema:
provider:field with siblingamazon_web_services:/google_cloud_platform:/azure:/hetzner_cloud:/local:keys. The real schema (perpkg/config/config.goand allexamples/*.yaml) iscluster.<provider-name>:discriminator with no top-levelprovider:field. Only the Hetzner section currently matches reality.certificate:,git_repository:, andexistingprovider sections entirely.CRD name:
nic-summary.md,17-appendix.md,11-nebari-operator.md,12-testing-strategy.md,13-milestones.mdall referenceNicApp/NebariApplication. The real CRD (from the upstreamnebari-operatorrepo) isNebariApp.Misframed:
11-nebari-operator.mddescribes the operator as if implemented in this repo. The operator is an external project atgithub.com/nebari-dev/nebari-operator. NIC only deploys it via ArgoCD.State management (
05-state-management.md):pkg/provider/aws/templates/backend.tf) uses S3-nativeuse_lockfile = true. DynamoDB is not used anywhere.state_backend:config block. No such block exists; bucket naming is deterministic perpkg/provider/aws/state.go.Testing/CI (
12-testing-strategy.md,13-milestones.md):moto,fake-gcs-server,azurite) are not used. Real test infra: LocalStack viadocker-compose.test.yml+make test-integration-local..github/workflows/ci.yml(wrong Go version, wrong test command, fictional jobs).Proposed work
Full rewrite of the heavily-drifted docs against current code:
architecture/02-system-overview.md,04-key-decisions.md,05-state-management.mdimplementation/06-opentofu-module-architecture.md,07-configuration-design.md,08-terraform-exec-integration.md,10-foundational-software.md,11-nebari-operator.mdappendix/16-configuration-reference.mdoperations/12-testing-strategy.md,13-milestones.mdSurgical edits for the docs that are directionally correct:
architecture/01-introduction.md,03-goals-and-non-goals.mdimplementation/09-dns-provider-architecture.mdappendix/14-open-questions.md,15-future-enhancements.md,17-appendix.mdoperations/longhorn-node-maintenance.mdnic-summary.mdDefinition of done
terraform/modules/...paths updated to actualpkg/provider/<name>/templates/layoutNebariApplication/NicApptoNebariAppexistingproviders documentedcluster.<name>:/dns.<name>:schema, with sections forcertificate:,git_repository:, and theexistingprovider