Skip to content

Commit bbedb60

Browse files
Neurostepampagent
andcommitted
docs: document maintenance mode feature
Update documentation across all relevant files: - README: add Maintenance Mode Integration to features list - API reference: add MaintenanceSpec type, MaintenanceMode condition, StartMaintenance/EndMaintenance upgrade phases - Architecture: add Maintenance Jobs to diagram and reconciliation strategy, add maintenance_jobs.go to project structure - Safe upgrade runbook: add Maintenance Mode section with YAML examples, update upgrade order and phases table Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd Co-authored-by: Amp <amp@ampcode.com>
1 parent a0f019a commit bbedb60

4 files changed

Lines changed: 71 additions & 6 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ It provides declarative cluster management through custom resources, enabling us
1212
- **Multiple Worker Pools**: Different sizing, node selectors, and tolerations per workload profile
1313
- **Scale-to-Zero**: Snapshot workers can scale to zero when no initial loads are running
1414
- **Automatic Lifecycle Management**: OwnerReferences enable automatic garbage collection on CR deletion
15+
- **Maintenance Mode Integration**: Gracefully pauses mirrors before upgrades and resumes them after via PeerDB's maintenance workflows
1516

1617
## Getting Started
1718

docs/api-reference/v1alpha1.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ This document describes all Custom Resource Definitions (CRDs) managed by the Pe
3232
| `paused` | `bool` | No | `false` | When true, the operator stops reconciling this cluster. |
3333
| `upgradePolicy` | [`UpgradePolicy`](#upgradepolicy) | No | `Automatic` | Controls how version upgrades are applied. Enum: `Automatic`, `Manual`. |
3434
| `maintenanceWindow` | [`MaintenanceWindow`](#maintenancewindow) | No || Time window for automatic upgrades. Only used when `upgradePolicy` is `Automatic`. |
35+
| `maintenance` | [`MaintenanceSpec`](#maintenancespec) | No || Configures PeerDB maintenance mode for graceful upgrades. When set, the operator pauses mirrors before upgrading and resumes them after. |
3536

3637
### PeerDBClusterStatus
3738

@@ -205,6 +206,16 @@ Defines a time window during which automatic upgrades may be applied.
205206
| `end` | `string` | **Yes** || End time in 24-hour `HH:MM` format. |
206207
| `timeZone` | `*string` | No | `UTC` | IANA timezone name (e.g., `America/New_York`). |
207208

209+
### MaintenanceSpec
210+
211+
Configuration for PeerDB maintenance mode during upgrades. When configured, the operator runs maintenance Jobs (`ghcr.io/peerdb-io/flow-maintenance`) to gracefully pause all mirrors before upgrading and resume them after.
212+
213+
| Field | Type | Required | Default | Description |
214+
|-------|------|----------|---------|-------------|
215+
| `image` | `*string` | No | `ghcr.io/peerdb-io/flow-maintenance:stable-{version}` | Container image override for the maintenance Job. |
216+
| `backoffLimit` | `*int32` | No | `4` | Number of retries before marking the maintenance Job as failed (min: 0). |
217+
| `resources` | `*ResourceRequirements` | No || CPU/memory resource requests and limits for the maintenance Job container. |
218+
208219
### UpgradePolicy
209220

210221
`string` enum controlling how version upgrades are applied.
@@ -232,7 +243,7 @@ Tracks the state of a rolling version upgrade.
232243
|-------|------|-------------|
233244
| `fromVersion` | `string` | The version being upgraded from. |
234245
| `toVersion` | `string` | The version being upgraded to. |
235-
| `phase` | `UpgradePhase` | Current upgrade phase. Values: `Complete`, `Waiting`, `Blocked`, `Config`, `InitJobs`, `FlowAPI`, `PeerDBServer`, `UI`. |
246+
| `phase` | `UpgradePhase` | Current upgrade phase. Values: `Complete`, `Waiting`, `Blocked`, `StartMaintenance`, `Config`, `InitJobs`, `FlowAPI`, `PeerDBServer`, `UI`, `EndMaintenance`. |
236247
| `startedAt` | `*metav1.Time` | Timestamp when the upgrade started. |
237248
| `message` | `string` | Human-readable message about the upgrade state. |
238249

@@ -361,6 +372,7 @@ The following condition types are used in `PeerDBCluster` status:
361372
| `Degraded` | Set to `True` when one or more components are unhealthy but the cluster is partially operational. |
362373
| `UpgradeInProgress` | Set to `True` when a version upgrade is in progress. |
363374
| `BackupSafe` | Whether it is safe to take a backup. `True` when no upgrade or rolling restart is in progress. `False` with reason `BackupInProgress` when the `peerdb.io/backup-in-progress` annotation is set, or `BackupUnsafe` when an upgrade/rollout is active. |
375+
| `MaintenanceMode` | Set to `True` when PeerDB maintenance mode is active (mirrors are paused for an upgrade). Set to `False` with reason `MaintenanceComplete` after mirrors are resumed. |
364376

365377
### Annotations
366378

@@ -383,9 +395,11 @@ The `UpgradeStatus.phase` field tracks progress through a rolling upgrade:
383395
|-------|-------------|
384396
| `Waiting` | Upgrade is pending (e.g., waiting for a maintenance window). |
385397
| `Blocked` | Upgrade is blocked (e.g., manual policy requires acknowledgement). |
398+
| `StartMaintenance` | Running the StartMaintenance Job to pause mirrors before upgrade. |
386399
| `Config` | Updating shared ConfigMap and configuration. |
387400
| `InitJobs` | Re-running init jobs if needed. |
388401
| `FlowAPI` | Rolling out the Flow API Deployment. |
389402
| `PeerDBServer` | Rolling out the PeerDB Server Deployment. |
390403
| `UI` | Rolling out the PeerDB UI Deployment. |
404+
| `EndMaintenance` | Running the EndMaintenance Job to resume mirrors after upgrade. |
391405
| `Complete` | Upgrade finished successfully. |

docs/architecture.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ flowchart TB
3737
UISvc["PeerDB UI\nService :3000"]
3838
NSJob["Temporal NS\nRegister Job"]
3939
SAJob["Search Attr\nJob"]
40+
MaintJob["Maintenance\nJobs"]
4041
end
4142
4243
subgraph ManagedByWorker["Owned by PeerDBWorkerPool"]
@@ -58,6 +59,7 @@ flowchart TB
5859
CC --> UISvc
5960
CC --> NSJob
6061
CC --> SAJob
62+
CC --> MaintJob
6163
6264
WC -->|"reads cluster config"| PeerDBCluster
6365
WC --> WorkerDep
@@ -121,9 +123,11 @@ A single CRD would force all scaling decisions through one reconciler and one sp
121123

122124
1. **Dependency validation** — Check catalog password Secret exists before proceeding
123125
2. **Shared infrastructure** — ServiceAccount → ConfigMap (connection config)
124-
3. **Init jobs** — Idempotent Temporal setup jobs; cluster waits for completion
125-
4. **Components** — Flow API → PeerDB Server → UI (Deployments + Services)
126-
5. **Status rollup** — Individual conditions aggregate into overall `Ready` condition
126+
3. **Maintenance mode** — If `spec.maintenance` is set, run StartMaintenance Job to pause mirrors (upgrade only)
127+
4. **Init jobs** — Idempotent Temporal setup jobs; cluster waits for completion
128+
5. **Components** — Flow API → PeerDB Server → UI (Deployments + Services)
129+
6. **End maintenance** — If `spec.maintenance` is set, run EndMaintenance Job to resume mirrors (upgrade only)
130+
7. **Status rollup** — Individual conditions aggregate into overall `Ready` condition
127131

128132
All managed resources have **OwnerReferences** set to the parent CR, enabling automatic garbage collection on deletion without custom finalizers.
129133

@@ -154,7 +158,8 @@ internal/
154158
├── ui.go # PeerDB UI Deployment + Service
155159
├── flow_worker.go # Flow Worker Deployment
156160
├── snapshot_worker.go # Snapshot Worker StatefulSet + headless Service
157-
└── init_jobs.go # Temporal init Jobs
161+
├── init_jobs.go # Temporal init Jobs
162+
└── maintenance_jobs.go # Maintenance mode Jobs
158163
159164
config/
160165
├── crd/bases/ # Generated CRD manifests

docs/runbooks/safe-upgrade.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,11 @@ For more control, use the manual upgrade policy:
7070
The controller enforces a specific rollout order to minimize disruption:
7171

7272
```
73-
ConfigMap/Secrets → Init Jobs → Flow API → PeerDB Server → UI
73+
[StartMaintenance →] ConfigMap/Secrets → Init Jobs → Flow API → PeerDB Server → UI [→ EndMaintenance]
7474
```
7575

7676
Each step must complete successfully before the next begins. This ensures:
77+
- Mirrors are gracefully paused before any component restarts (when `spec.maintenance` is configured).
7778
- Configuration is propagated before any component restarts.
7879
- The Flow API (gRPC backend) is ready before the Server and UI that depend on it.
7980
- The UI is upgraded last since it's the least critical component.
@@ -102,6 +103,48 @@ spec:
102103
- Remove or omit `maintenanceWindow` to allow upgrades at any time.
103104
- If `timeZone` is not specified, it defaults to UTC.
104105

106+
## Maintenance Mode
107+
108+
PeerDB has a built-in maintenance mode that gracefully pauses all running mirrors before an upgrade and resumes them after. The operator integrates this via Kubernetes Jobs:
109+
110+
```yaml
111+
apiVersion: peerdb.peerdb.io/v1alpha1
112+
kind: PeerDBCluster
113+
metadata:
114+
name: peerdb
115+
spec:
116+
version: "v0.37.0"
117+
maintenance: {}
118+
# ... rest of spec
119+
```
120+
121+
When `spec.maintenance` is set, the upgrade flow becomes:
122+
123+
1. **StartMaintenance** — A Job runs using the `flow-maintenance` image with `start` command. This triggers PeerDB's `StartMaintenance` Temporal workflow, which waits for running snapshots, enables maintenance mode (`PEERDB_MAINTENANCE_MODE_ENABLED`), and pauses all running mirrors.
124+
2. **Normal upgrade** — Config, init jobs, Flow API, Server, and UI are rolled out in order.
125+
3. **EndMaintenance** — A Job runs with the `end` command, resuming all previously paused mirrors and disabling maintenance mode.
126+
127+
While maintenance mode is active, mirrors cannot be created or mutated through PeerDB.
128+
129+
### Customizing the Maintenance Job
130+
131+
```yaml
132+
spec:
133+
maintenance:
134+
image: "custom-registry/flow-maintenance:v1.0.0" # Override image
135+
backoffLimit: 6 # Retry count
136+
resources:
137+
requests:
138+
cpu: "100m"
139+
memory: "128Mi"
140+
```
141+
142+
If a maintenance Job fails, the operator deletes it and retries automatically. A `Degraded` condition is set so you can monitor failures via:
143+
144+
```bash
145+
kubectl get peerdbcluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="MaintenanceMode")'
146+
```
147+
105148
## Monitoring Upgrade Progress
106149

107150
### Quick Status
@@ -140,8 +183,10 @@ Example output:
140183
| `FlowAPI` | Rolling out Flow API Deployment |
141184
| `PeerDBServer` | Rolling out PeerDB Server Deployment |
142185
| `UI` | Rolling out UI Deployment |
186+
| `EndMaintenance` | Running EndMaintenance Job (resuming mirrors) |
143187
| `Complete` | Upgrade finished successfully |
144188
| `Blocked` | Upgrade blocked — dependencies are unhealthy |
189+
| `StartMaintenance` | Running StartMaintenance Job (pausing mirrors) |
145190

146191
### Watch Upgrade Events
147192

0 commit comments

Comments
 (0)