You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/self-hosting/govern/high-availability.md
+45-44Lines changed: 45 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: High Availability Deployment
3
3
description: How to deploy Plane Enterprise on Kubernetes with high availability using the plane-enterprise Helm chart.
4
4
---
5
5
6
-
# High Availability on Kubernetes
6
+
# High Availability on Kubernetes
7
7
8
8
This guide covers what high availability means, how the `plane-enterprise` Helm chart workloads behave under failure, and exactly what to configure so your deployment survives the loss of a single availability zone or node without manual recovery. The setup is cloud-agnostic. If you're deploying on AWS with Karpenter, there's a dedicated section for you.
9
9
@@ -31,13 +31,13 @@ Run at least `replicas: 2` per service. Use `replicas >= 2` for `api`, `worker`,
31
31
32
32
These do scheduled or coordinator work. **Do not scale any of them past `replicas: 1`** - running two copies doubles job execution.
33
33
34
-
| Workload | Kind | Why it stays at 1 |
35
-
|---|---|---|
36
-
|`monitor`| StatefulSet | Coordinator role; owns a `ReadWriteOnce` PVC |
| AWS | NLB or ALB with cross-zone load balancing enabled |
94
+
| GCP | Default global LB |
95
+
| Azure | Standard Load Balancer with zones `[1,2,3]` |
96
+
| On-prem | MetalLB in BGP mode, or an external LB |
97
97
98
98
**4. A working `IngressClass`.** The chart supports `traefik` (default) or `nginx`. Deploy the ingress controller with `replicas >= 2` spread across AZs.
99
99
@@ -141,13 +141,13 @@ Tier-1 pods spread across AZs. All Tier-3 state lives in managed services that h
141
141
142
142
The chart supports pointing each stateful component at a remote managed service. Use these value keys.
The chart labels every workload with `app.name` set to <codev-pre>{{ .Release.Namespace }}-{{ .Release.Name }}-<svc></code>. For a release named `plane` in namespace `plane`, that's `plane-plane-api` for the API.
209
209
210
210
:::warning
211
-
**Watch for this**
211
+
**Watch for this**
212
212
The hard hostname anti-affinity rule requires at least as many schedulable nodes as the workload's replica count. Three `api` replicas need three nodes available, or pods sit `Pending`. If you can't guarantee that (small cluster, dedicated taints), relax the hostname rule to `preferredDuringSchedulingIgnoredDuringExecution`.
213
213
:::
214
214
@@ -328,7 +328,7 @@ Add similar PDBs for `pi`, `pi_worker`, `outbox_poller`, `automation_consumer`,
328
328
329
329
## HorizontalPodAutoscalers
330
330
331
-
:::info
331
+
:::info
332
332
Native HPA rendering is planned for a future release. Apply the manifests below yourself until then.
- The `live` service uses WebSockets. Make sure your ingress controller and LB don't have idle-timeout values that drop long-lived connections. The default AWS NLB idle timeout is 350s - that's usually fine. ALB defaults to 60s and needs raising for WebSocket connections.
574
575
575
576
- The chart configures request-body size limits via `ingress.traefik.maxRequestBodyBytes` (Traefik) and `nginx.ingress.kubernetes.io/proxy-body-size` (nginx). Tune these to your expected file upload size.
@@ -578,14 +579,14 @@ services:
578
579
579
580
HA protects against AZ and node failure. Backups protect against logical corruption, accidental deletion, and ransomware. You need both.
| No native `topologySpreadConstraints` in `plane.podScheduling` | Use `podAntiAffinity` as shown in the spreading section - functionally equivalent for AZ spread |
623
-
| No PDBs rendered by the chart | Apply the PDB manifests from the PodDisruptionBudgets section |
624
-
| No HPAs rendered by the chart | Apply the HPA manifests from the HorizontalPodAutoscalers section |
625
-
| In-chart Tier-3 StatefulSets are single-replica, RWO | Set `local_setup: false` and use managed services |
626
-
| `monitor` is a singleton StatefulSet | Accept the 60–120s reschedule window on AZ failure - it's internal and non-user-facing |
624
+
| No PDBs rendered by the chart | Apply the PDB manifests from the PodDisruptionBudgets section |
625
+
| No HPAs rendered by the chart | Apply the HPA manifests from the HorizontalPodAutoscalers section |
626
+
| In-chart Tier-3 StatefulSets are single-replica, RWO | Set `local_setup: false` and use managed services |
627
+
| `monitor` is a singleton StatefulSet | Accept the 60–120s reschedule window on AZ failure - it's internal and non-user-facing |
627
628
628
629
## Reference values.yaml for HA
629
630
@@ -690,15 +691,15 @@ services:
690
691
- { key: app.name, operator: In, values: [plane-plane-api] }
691
692
topologyKey: topology.kubernetes.io/zone
692
693
693
-
web: { replicas: 3 }
694
-
space: { replicas: 2 }
695
-
admin: { replicas: 2 }
696
-
live: { replicas: 3 }
694
+
web: { replicas: 3 }
695
+
space: { replicas: 2 }
696
+
admin: { replicas: 2 }
697
+
live: { replicas: 3 }
697
698
worker: { replicas: 4 }
698
-
silo: { enabled: true, replicas: 2 }
699
+
silo: { enabled: true, replicas: 2 }
699
700
700
-
beatworker: { replicas: 1 } # singleton - do not scale
701
-
pi_beat_worker: { replicas: 1 } # singleton - do not scale
701
+
beatworker: { replicas: 1 } # singleton - do not scale
702
+
pi_beat_worker: { replicas: 1 } # singleton - do not scale
702
703
```
703
704
704
705
Repeat the `affinity` block (varying the pod label) for every Tier-1 service. YAML anchors (`&spread-api` / `*spread-api`) help avoid repetition.
0 commit comments