Skip to content

Audit and rewrite design docs to match current code #300

@dcmcand

Description

@dcmcand

Problem

The design docs under docs/design-doc/ have drifted substantially from the codebase. A full audit of all 18 files against actual code found that most files contain meaningfully inaccurate claims: invented CLI commands, wrong package layouts, fictional code samples, wrong YAML config schemas, and a foundational software stack (LGTM) that is not deployed.

Code is the source of truth; docs need to match it.

High-level drift

Provider story (cuts across most docs):

  • Docs assume a uniform OpenTofu-driven multi-cloud implementation. Reality: only AWS uses tofu. Hetzner uses the hetzner-k3s binary, existing is a no-op, local is a Kind stub via make localkind-up, and gcp/azure are stubs that return "not yet implemented". Hetzner and existing providers are not mentioned in the design docs at all.
  • ADR-0004 (out-of-tree provider plugins, Proposed 2026-04-15) is not cross-referenced anywhere in the design doc.

Package layout:

  • Many docs reference a root-level terraform/ directory with modules/{aws,gcp,azure,local,kubernetes,argocd,foundational-apps}/. None of that exists. AWS templates live at pkg/provider/aws/templates/. There are no GCP/Azure/local terraform modules at all.
  • pkg/operator, pkg/tofu/executor.go, pkg/tofu/workspace.go, pkg/tofu/outputs.go, pkg/kubernetes/, api/v1alpha1/ are referenced but do not exist.

CLI commands:

  • Docs reference nic status, nic plan, nic state list/show/rm/mv, nic unlock, nic init-backend, nic health check, nic stack ..., nic init, nic marketplace. None of these exist. Real verbs: deploy, destroy, validate, kubeconfig, version.

Foundational software:

  • Docs claim the LGTM stack (Grafana, Loki, Mimir, Tempo, Promtail) is deployed. It is not. Only opentelemetry-collector is shipped from that family. Actual apps under pkg/argocd/templates/apps/: cert-manager, cluster-issuers, certificates, envoy-gateway, gateway-config, httproutes, keycloak, metallb, metallb-config, nebari-landingpage, nebari-operator, opentelemetry-collector, postgresql, root.

Config schema:

  • The configuration reference documents a top-level provider: field with sibling amazon_web_services: / google_cloud_platform: / azure: / hetzner_cloud: / local: keys. The real schema (per pkg/config/config.go and all examples/*.yaml) is cluster.<provider-name>: discriminator with no top-level provider: field. Only the Hetzner section currently matches reality.
  • The reference doc is missing the certificate:, git_repository:, and existing provider sections entirely.

CRD name:

  • nic-summary.md, 17-appendix.md, 11-nebari-operator.md, 12-testing-strategy.md, 13-milestones.md all reference NicApp / NebariApplication. The real CRD (from the upstream nebari-operator repo) is NebariApp.

Misframed:

  • 11-nebari-operator.md describes the operator as if implemented in this repo. The operator is an external project at github.com/nebari-dev/nebari-operator. NIC only deploys it via ArgoCD.

State management (05-state-management.md):

  • Documents DynamoDB-based locking. Real backend (pkg/provider/aws/templates/backend.tf) uses S3-native use_lockfile = true. DynamoDB is not used anywhere.
  • Documents a state_backend: config block. No such block exists; bucket naming is deterministic per pkg/provider/aws/state.go.

Testing/CI (12-testing-strategy.md, 13-milestones.md):

  • Mocking libraries listed (moto, fake-gcs-server, azurite) are not used. Real test infra: LocalStack via docker-compose.test.yml + make test-integration-local.
  • Documented CI YAML doesn't match .github/workflows/ci.yml (wrong Go version, wrong test command, fictional jobs).
  • Milestones doc marks GCP/Azure providers, multi-cloud CI, LGTM stack, Grafana dashboards, and v1.0.0 release as ✅ done; none are.

Proposed work

Full rewrite of the heavily-drifted docs against current code:

  • architecture/02-system-overview.md, 04-key-decisions.md, 05-state-management.md
  • implementation/06-opentofu-module-architecture.md, 07-configuration-design.md, 08-terraform-exec-integration.md, 10-foundational-software.md, 11-nebari-operator.md
  • appendix/16-configuration-reference.md
  • operations/12-testing-strategy.md, 13-milestones.md

Surgical edits for the docs that are directionally correct:

  • architecture/01-introduction.md, 03-goals-and-non-goals.md
  • implementation/09-dns-provider-architecture.md
  • appendix/14-open-questions.md, 15-future-enhancements.md, 17-appendix.md
  • operations/longhorn-node-maintenance.md
  • nic-summary.md

Definition of done

  • Every claim in each doc verified against current code or removed
  • All references to terraform/modules/... paths updated to actual pkg/provider/<name>/templates/ layout
  • All references to nonexistent CLI commands removed or marked clearly as future work
  • LGTM stack references either removed or moved to a "Future / Not Yet Implemented" section
  • CRD references corrected from NebariApplication / NicApp to NebariApp
  • Hetzner and existing providers documented
  • ADR-0004 cross-referenced where relevant
  • Configuration reference rewritten to match the real cluster.<name>: / dns.<name>: schema, with sections for certificate:, git_repository:, and the existing provider
  • Tests pass; lint passes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions