Skip to content

Commit 2f8bfcb

Browse files
committed
formatting fixes
1 parent ca6b1cf commit 2f8bfcb

2 files changed

Lines changed: 49 additions & 49 deletions

File tree

docs/.vitepress/config.mts

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -252,12 +252,11 @@ export default withMermaid(
252252
{ text: "Download config files", link: "/self-hosting/methods/download-config" },
253253
],
254254
},
255-
{
256-
text: "Kubernetes", link: "/self-hosting/methods/kubernetes",
255+
{
256+
text: "Kubernetes",
257+
link: "/self-hosting/methods/kubernetes",
257258
collapsed: true,
258-
items: [
259-
{ text: "High availability", link: "/self-hosting/govern/high-availability" }
260-
],
259+
items: [{ text: "High availability", link: "/self-hosting/govern/high-availability" }],
261260
},
262261
{ text: "Podman Quadlets", link: "/self-hosting/methods/podman-quadlets" },
263262
{

docs/self-hosting/govern/high-availability.md

Lines changed: 45 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: High Availability Deployment
33
description: How to deploy Plane Enterprise on Kubernetes with high availability using the plane-enterprise Helm chart.
44
---
55

6-
# High Availability on Kubernetes
6+
# High Availability on Kubernetes
77

88
This guide covers what high availability means, how the `plane-enterprise` Helm chart workloads behave under failure, and exactly what to configure so your deployment survives the loss of a single availability zone or node without manual recovery. The setup is cloud-agnostic. If you're deploying on AWS with Karpenter, there's a dedicated section for you.
99

@@ -31,13 +31,13 @@ Run at least `replicas: 2` per service. Use `replicas >= 2` for `api`, `worker`,
3131

3232
These do scheduled or coordinator work. **Do not scale any of them past `replicas: 1`** - running two copies doubles job execution.
3333

34-
| Workload | Kind | Why it stays at 1 |
35-
|---|---|---|
36-
| `monitor` | StatefulSet | Coordinator role; owns a `ReadWriteOnce` PVC |
37-
| `beatworker` | Deployment | Celery beat - schedules periodic Plane jobs |
38-
| `pi_beat_worker` | Deployment | PI beat - schedules periodic PI jobs |
39-
| `migrator` | Job | DB migration; runs once per release |
40-
| `pi-migrator` | Job | PI DB migration; runs once per release |
34+
| Workload | Kind | Why it stays at 1 |
35+
| ---------------- | ----------- | -------------------------------------------- |
36+
| `monitor` | StatefulSet | Coordinator role; owns a `ReadWriteOnce` PVC |
37+
| `beatworker` | Deployment | Celery beat - schedules periodic Plane jobs |
38+
| `pi_beat_worker` | Deployment | PI beat - schedules periodic PI jobs |
39+
| `migrator` | Job | DB migration; runs once per release |
40+
| `pi-migrator` | Job | PI DB migration; runs once per release |
4141

4242
The stateless singletons (`beatworker`, `pi_beat_worker`) reschedule onto a healthy node within seconds when their node fails.
4343

@@ -88,12 +88,12 @@ env:
8888

8989
**3. A cross-zone load balancer.** Traffic must reach pods in any AZ.
9090

91-
| Cloud | Recommendation |
92-
|---|---|
93-
| AWS | NLB or ALB with cross-zone load balancing enabled |
94-
| GCP | Default global LB |
95-
| Azure | Standard Load Balancer with zones `[1,2,3]` |
96-
| On-prem | MetalLB in BGP mode, or an external LB |
91+
| Cloud | Recommendation |
92+
| ------- | ------------------------------------------------- |
93+
| AWS | NLB or ALB with cross-zone load balancing enabled |
94+
| GCP | Default global LB |
95+
| Azure | Standard Load Balancer with zones `[1,2,3]` |
96+
| On-prem | MetalLB in BGP mode, or an external LB |
9797

9898
**4. A working `IngressClass`.** The chart supports `traefik` (default) or `nginx`. Deploy the ingress controller with `replicas >= 2` spread across AZs.
9999

@@ -141,13 +141,13 @@ Tier-1 pods spread across AZs. All Tier-3 state lives in managed services that h
141141
142142
The chart supports pointing each stateful component at a remote managed service. Use these value keys.
143143
144-
| Component | Disable local | External URL / credentials |
145-
|---|---|---|
146-
| Postgres | `services.postgres.local_setup: false` | `env.pgdb_remote_url`, `env.pg_pi_db_remote_url`; optional read replica via `services.postgres.read_replica.enabled` + `services.postgres.read_replica.remote_url` |
147-
| Redis | `services.redis.local_setup: false` | `env.remote_redis_url` |
148-
| RabbitMQ | `services.rabbitmq.local_setup: false` | `services.rabbitmq.external_rabbitmq_url` |
149-
| OpenSearch | `services.opensearch.local_setup: false` | `env.opensearch_remote_url`, `env.opensearch_remote_username`, `env.opensearch_remote_password`; optional `env.opensearch_index_prefix` for multi-tenant clusters |
150-
| Object store | `services.minio.local_setup: false` | `env.aws_access_key`, `env.aws_secret_access_key`, `env.aws_region`, `env.aws_s3_endpoint_url`, `env.docstore_bucket` |
144+
| Component | Disable local | External URL / credentials |
145+
| ------------ | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
146+
| Postgres | `services.postgres.local_setup: false` | `env.pgdb_remote_url`, `env.pg_pi_db_remote_url`; optional read replica via `services.postgres.read_replica.enabled` + `services.postgres.read_replica.remote_url` |
147+
| Redis | `services.redis.local_setup: false` | `env.remote_redis_url` |
148+
| RabbitMQ | `services.rabbitmq.local_setup: false` | `services.rabbitmq.external_rabbitmq_url` |
149+
| OpenSearch | `services.opensearch.local_setup: false` | `env.opensearch_remote_url`, `env.opensearch_remote_username`, `env.opensearch_remote_password`; optional `env.opensearch_index_prefix` for multi-tenant clusters |
150+
| Object store | `services.minio.local_setup: false` | `env.aws_access_key`, `env.aws_secret_access_key`, `env.aws_region`, `env.aws_s3_endpoint_url`, `env.docstore_bucket` |
151151
152152
### What HA looks like for each service
153153
@@ -208,7 +208,7 @@ services:
208208
The chart labels every workload with `app.name` set to <code v-pre>{{ .Release.Namespace }}-{{ .Release.Name }}-&lt;svc&gt;</code>. For a release named `plane` in namespace `plane`, that's `plane-plane-api` for the API.
209209

210210
:::warning
211-
**Watch for this**
211+
**Watch for this**
212212
The hard hostname anti-affinity rule requires at least as many schedulable nodes as the workload's replica count. Three `api` replicas need three nodes available, or pods sit `Pending`. If you can't guarantee that (small cluster, dedicated taints), relax the hostname rule to `preferredDuringSchedulingIgnoredDuringExecution`.
213213
:::
214214

@@ -328,7 +328,7 @@ Add similar PDBs for `pi`, `pi_worker`, `outbox_poller`, `automation_consumer`,
328328

329329
## HorizontalPodAutoscalers
330330

331-
:::info
331+
:::info
332332
Native HPA rendering is planned for a future release. Apply the manifests below yourself until then.
333333
:::
334334

@@ -570,6 +570,7 @@ services:
570570
```yaml
571571
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
572572
```
573+
573574
- The `live` service uses WebSockets. Make sure your ingress controller and LB don't have idle-timeout values that drop long-lived connections. The default AWS NLB idle timeout is 350s - that's usually fine. ALB defaults to 60s and needs raising for WebSocket connections.
574575

575576
- The chart configures request-body size limits via `ingress.traefik.maxRequestBodyBytes` (Traefik) and `nginx.ingress.kubernetes.io/proxy-body-size` (nginx). Tune these to your expected file upload size.
@@ -578,14 +579,14 @@ services:
578579

579580
HA protects against AZ and node failure. Backups protect against logical corruption, accidental deletion, and ransomware. You need both.
580581

581-
| Component | Backup mechanism | Recommended retention |
582-
|---|---|---|
583-
| Postgres | Managed-service automated backups + PITR | 30 days, PITR ≥ 7 days |
584-
| Object storage | Bucket versioning + lifecycle to a different bucket/region | 90 days |
585-
| OpenSearch | Snapshots to object storage | 7 days |
586-
| Redis | Optional; treat as cache + queue. Document what your team loses on a full Redis failure (sessions, in-flight Celery tasks). | - |
587-
| RabbitMQ | Definitions export (users, queues, bindings) on a schedule; messages are transient | - |
588-
| Kubernetes objects | Velero, namespace-scoped, daily | 30 days |
582+
| Component | Backup mechanism | Recommended retention |
583+
| ------------------ | --------------------------------------------------------------------------------------------------------------------------- | ---------------------- |
584+
| Postgres | Managed-service automated backups + PITR | 30 days, PITR ≥ 7 days |
585+
| Object storage | Bucket versioning + lifecycle to a different bucket/region | 90 days |
586+
| OpenSearch | Snapshots to object storage | 7 days |
587+
| Redis | Optional; treat as cache + queue. Document what your team loses on a full Redis failure (sessions, in-flight Celery tasks). | - |
588+
| RabbitMQ | Definitions export (users, queues, bindings) on a schedule; messages are transient | - |
589+
| Kubernetes objects | Velero, namespace-scoped, daily | 30 days |
589590

590591
**Run a restore drill** before go-live and at least once per quarter. A backup that's never been restored is an assumption, not a guarantee.
591592

@@ -617,13 +618,13 @@ Work through every item before sending real traffic.
617618

618619
The following capabilities aren't natively provided by the chart and need to be applied separately.
619620

620-
| Gap | Workaround |
621-
|---|---|
621+
| Gap | Workaround |
622+
| -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
622623
| No native `topologySpreadConstraints` in `plane.podScheduling` | Use `podAntiAffinity` as shown in the spreading section - functionally equivalent for AZ spread |
623-
| No PDBs rendered by the chart | Apply the PDB manifests from the PodDisruptionBudgets section |
624-
| No HPAs rendered by the chart | Apply the HPA manifests from the HorizontalPodAutoscalers section |
625-
| In-chart Tier-3 StatefulSets are single-replica, RWO | Set `local_setup: false` and use managed services |
626-
| `monitor` is a singleton StatefulSet | Accept the 60–120s reschedule window on AZ failure - it's internal and non-user-facing |
624+
| No PDBs rendered by the chart | Apply the PDB manifests from the PodDisruptionBudgets section |
625+
| No HPAs rendered by the chart | Apply the HPA manifests from the HorizontalPodAutoscalers section |
626+
| In-chart Tier-3 StatefulSets are single-replica, RWO | Set `local_setup: false` and use managed services |
627+
| `monitor` is a singleton StatefulSet | Accept the 60–120s reschedule window on AZ failure - it's internal and non-user-facing |
627628

628629
## Reference values.yaml for HA
629630

@@ -690,15 +691,15 @@ services:
690691
- { key: app.name, operator: In, values: [plane-plane-api] }
691692
topologyKey: topology.kubernetes.io/zone
692693
693-
web: { replicas: 3 }
694-
space: { replicas: 2 }
695-
admin: { replicas: 2 }
696-
live: { replicas: 3 }
694+
web: { replicas: 3 }
695+
space: { replicas: 2 }
696+
admin: { replicas: 2 }
697+
live: { replicas: 3 }
697698
worker: { replicas: 4 }
698-
silo: { enabled: true, replicas: 2 }
699+
silo: { enabled: true, replicas: 2 }
699700
700-
beatworker: { replicas: 1 } # singleton - do not scale
701-
pi_beat_worker: { replicas: 1 } # singleton - do not scale
701+
beatworker: { replicas: 1 } # singleton - do not scale
702+
pi_beat_worker: { replicas: 1 } # singleton - do not scale
702703
```
703704

704705
Repeat the `affinity` block (varying the pod label) for every Tier-1 service. YAML anchors (`&spread-api` / `*spread-api`) help avoid repetition.

0 commit comments

Comments
 (0)