This note records the operator-side contract for the self-serve Helm path.
It is the server-side view of the engine contract in
durable-workflow/workflow#docs/deployment/helm.md:
the library doc names what the chart can promise without weakening any
existing engine guarantee; this doc names the operator topology, validation
harness, distribution path, and recovery-packet evidence required to claim
the self-serve Helm contract on the standalone Durable Workflow Server image.
The chart itself lives at
k8s/helm/durable-workflow/ with
install/upgrade docs and a
chart-version upgrading guide.
Promote Helm from "support-led" to a self-serve contract that is layered on the existing raw-Kubernetes contract.
The first public Helm shape:
- Renders the same workloads as the raw manifests at
k8s/: bootstrap Job, API Deployment + Service + PodDisruptionBudget, worker Deployment, scheduler/maintenance CronJob. - Encodes the engine invariants (singleton scheduler, readiness on
/api/ready, externals-first persistence, bootstrap-before-workloads ordering) as defaults that cannot be silently subverted via values. - Distributes versioned chart releases via OCI
(
oci://ghcr.io/durable-workflow/charts/durable-workflow) and an HTTPS index hosted on the docs site (https://durable-workflow.github.io/charts/). - Has chart CI:
helm lint,helm templateagainst multiple values fixtures,kubeconformagainst every Kubernetes version in the support matrix, andchart-testing(ct lint-and-install) against a kind cluster on every PR that touchesk8s/helm/**. - Carries a chart-side semver and an upgrading guide. The chart's
versionis independent of the server image'sappVersion; both follow semver, and breaking chart changes require a chart-MAJOR bump and a documented migration step.
Active/active multi-region, hands-free regional failover, duplicate scheduler runners as a steady-state topology, and provider-specific managed-Kubernetes validation remain support-led — the same boundary that applies to the raw-manifest contract. Helm is a packaging and rollout contract, not a new engine topology.
Three observations make a narrow self-serve Helm contract possible:
-
The engine's existing single-region HA contract is independent of how manifests are produced. It cares about the resulting workloads — readiness probes, singleton scheduler, shared substrate — not whether they were rendered by
kustomize,helm template, or copy-pasted YAML. The chart's job is to ensure those rendered workloads keep matching the contract; nothing in the chart relaxes any engine guarantee. -
External persistence is already the contract. The raw manifests have never bundled MySQL or Redis; the chart inherits that. This avoids the common Helm-chart pitfall of shipping a "convenient" embedded database that becomes load-bearing in production and can't be rolled forward safely.
-
Helm hooks plus GitOps-friendly annotations cover both rollout models. Pre-install/pre-upgrade hooks keep the bootstrap Job ahead of workloads under
helm install/helm upgrade. Argo CD sync-wave and Flux depends-on annotations achieve the same ordering for controllers that don't honour Helm hooks. Operators don't have to re-design rollout per orchestrator.
| Invariant | Where the chart enforces it | Why it can't be relaxed |
|---|---|---|
| Singleton scheduler/maintenance runner. | scheduler.concurrencyPolicy is locked to Forbid (values schema rejects anything else); the chart never renders a parallel scheduler Deployment or StatefulSet. |
Duplicate scheduler runners as a steady-state topology fall outside the single-region HA contract. |
Readiness checked on /api/ready (DB+Redis usability), not /api/health. |
server.readinessProbe defaults to /api/ready; example values and the README call out the consequence of changing it. |
A load balancer routing on /api/health would not drain a node whose database had failed over. |
| Bootstrap runs before workloads. | Migration Job carries Helm hooks (pre-install,pre-upgrade,hook-weight=-5,before-hook-creation,hook-succeeded) and Argo CD sync-wave / Flux depends-on annotations. |
Workload pods that boot against an unmigrated database fail readiness, then page at random. |
| External persistence is the contract. | The chart fails during render when externalDatabase.host or externalRedis.host is empty; no in-cluster database or Redis template exists in the chart. |
Bundling the database silently moves it into the chart's lifecycle and breaks the recovery-packet contract from Operator Operating Envelope. |
RollingUpdate defaults to maxUnavailable: 0. |
server.strategy.rollingUpdate.maxUnavailable defaults to 0. |
Matches the rolling-upgrade contract; operators who haven't met that contract should set strategy.type: Recreate instead. |
The chart exposes — and CI exercises — the knobs the deployment matrix calls out as table-stakes for a serious Helm path:
- Liveness/readiness probes are first-class values
(
server.livenessProbe,server.readinessProbe,server.startupProbe,worker.livenessProbe). - Deployment strategy is explicit (
server.strategy,worker.strategy) and defaults to the rolling-upgrade contract. - Rollout pacing lives in
server.minReadySeconds,progressDeadlineSeconds,revisionHistoryLimit,topologySpreadConstraints, and HPA settings. - PodDisruptionBudget is a values toggle for both server and worker
(
*.pdb.enabled,*.pdb.minAvailable,*.pdb.maxUnavailable). - Resource knobs (
*.resources.requests,*.resources.limits) and pod-level scheduling controls (nodeSelector,tolerations,affinity,priorityClassName) are exposed for every workload. - Secret-management paths support the existing-secret pattern for the
app secret, the database secret, and the Redis secret. The example
examples/values-external-secrets-operator.yamlshows the External Secrets Operator companion shape; equivalent patterns work for Vault Secret Operator and the Secrets Store CSI Driver. - External persistence is the contract: no bundled databases, no bundled Redis, no in-cluster default the operator ends up depending on.
./.github/workflows/helm-chart-validation.yml runs on every PR that
touches k8s/helm/** or the chart smoke script. The harness has two jobs:
lint-and-template— runshelm lintover the chart and over every CI fixture, renders each fixture withhelm template, and runskubeconformover the rendered output against every Kubernetes version inKUBE_VERSIONS(currently 1.27 — 1.30).ct-lint-install— runschart-testing'sct lint, then provisions a kind cluster and runsscripts/helm-chart-kind-smoke.sh, which:- builds the server image and loads it into kind;
- provisions disposable in-cluster MySQL + Redis fixtures;
helm installs the chart with inline secrets;- waits for the bootstrap Job, the server Deployment, and the worker Deployment to roll out;
- asserts
/api/readyreturns 200 through the chart-rendered Service; - registers a worker against
/api/worker/registerto confirm the end-to-end engine surface; - runs
helm test; - runs
helm upgradein place and re-checks readiness; - runs
helm uninstalland asserts no chart-owned resources remain.
The CI fixtures intentionally cover three shapes:
| Fixture | Purpose |
|---|---|
ci/inline-secrets-values.yaml |
Smallest renderable chart — exercises secret rendering. |
ci/existing-secrets-values.yaml |
GitOps / externally-managed-secret path. The chart renders no Secret resources of its own. |
ci/ingress-and-hpa-values.yaml |
Optional templates: Ingress, HPA, NetworkPolicy, server + worker PDB. |
A chart change that affects rendered manifests must keep all three fixtures green, and the kind smoke must end with a clean uninstall.
Releases are cut from Chart.yaml.version. Each release publishes:
- an OCI artifact at
oci://ghcr.io/durable-workflow/charts/durable-workflowwith the chart's version as the OCI tag; - an HTTPS index update at
https://durable-workflow.github.io/charts/with the packaged tarball +index.yamlentry.
The release pipeline is the existing chart-release flow that already runs for the docs site; the chart-publishing job is added behind the same gate as the chart-validation workflow above so that a chart cannot be published without first passing lint, schema validation, and the kind smoke.
The "install/upgrade from released charts, not only from a checkout" line
on the deployment guide is satisfied by both distribution paths; the
checkout install (helm install ./k8s/helm/durable-workflow/) remains
supported for chart development and air-gapped review.
A deployment claiming the self-serve Helm contract MUST extend its recovery packet (per the Operator Operating Envelope) with two pieces of evidence on top of the raw-manifest packet:
- the chart version + values revision that produced the running
manifests (
helm get values <release> --revision <n>), so a recovery rebuild reproduces the same manifests; - the chart-upgrade rehearsal for the most recent chart MINOR or
MAJOR boundary the deployment crossed, including a successful
helm rollbackfrom the new version to the previous version, with/api/readyreturning 200 at both ends.
A deployment that has not run that chart-upgrade rehearsal is not yet self-serve under this contract; it remains support-led until the rehearsal evidence is recorded and refreshed on the cadence the Operator Operating Envelope publishes.
The Helm contract is the same boundary as the raw-manifest contract, narrowed to the rollout/packaging surface the chart owns. The following remain outside it and continue to require a support-led design pass:
- active/active multi-writer database topologies;
- automatic or hands-free regional failover (active/passive with operator-driven failover stays the next contract over);
- duplicate scheduler/maintenance runners as a steady-state topology;
- engine-enforced region-pinned task queues;
- provider-specific managed-Kubernetes validation (EKS, GKE, AKS, OpenShift);
- broad "five-nines" or "zero-downtime" SLA promises beyond the bounded recovery times in the single-region HA failover contract.
The contract is self-serve install, upgrade, and rollback of a Durable Workflow deployment using published Helm charts, not an uptime promise that depends on the operator's database, network, or orchestrator choices. Marketing or SLA language for self-hosted Helm deployments MUST NOT cross that line without dedicated validation.