@@ -164,6 +164,40 @@ status:
164164 completionTimestamp : " 2026-03-02T08:32:00Z"
165165` ` `
166166
167+ ### CRD Field Defaults and Validation
168+
169+ **Backup CR defaults:**
170+ ` ` ` yaml
171+ spec :
172+ oadp :
173+ enabled : true # OADP integration enabled by default
174+ snapshotMoveData : true # Default: copy to S3 (production-safe)
175+ ttl : 720h # Default: 30 days retention
176+ namespace : openshift-adp # Default OADP namespace
177+ storage :
178+ pvc : openstack-backup-storage # Default PVC name
179+ size : 10Gi # Default PVC size (if controller creates it)
180+ storageClass : " " # Default: use cluster default StorageClass
181+ ` ` `
182+
183+ **Restore CR defaults:**
184+ ` ` ` yaml
185+ spec :
186+ oadp :
187+ enabled : true # OADP integration enabled by default
188+ namespace : openshift-adp # Default OADP namespace
189+ storage :
190+ pvc : openstack-backup-storage # Default PVC name (must match backup)
191+ validation :
192+ enabled : false # Default: no automatic validation
193+ ` ` `
194+
195+ **Validation rules:**
196+ - ` oadp.ttl` must be >= 1h
197+ - ` storage.size` must be >= 1Gi
198+ - ` storage.pvc` required (can be defaulted)
199+ - ` backupName` required for restore
200+
167201# ## Controller Implementation
168202
169203**Location:** openstack-operator
@@ -199,7 +233,7 @@ Each controller:
199233 - ` docs/dev/playbooks/restore-openstack-ctlplane.yaml`
200234 - ` docs/dev/playbooks/restore-openstack-dataplane.yaml`
2012354. Updates CR status based on Job progress
202- 5. Handles cleanup (Job retention, old backup deletion based on retention policy )
236+ 5. Handles Job cleanup (retention of completed ansible-runner Jobs )
203237
204238# ## Playbook Override Mechanism
205239
@@ -275,14 +309,87 @@ spec:
275309 credentialsSecret: s3-creds
276310` ` `
277311
312+ # ## Concurrent Backup Handling
313+
314+ **Question:** What if hourly and daily backups run simultaneously?
315+
316+ **Answer:** PVC with ReadWriteMany (RWX) supports concurrent access, but:
317+
318+ **Archive file conflict:**
319+ - Hourly writes : ` /backup/openstack-ctlplane-backup-TIMESTAMP1.tar.gz`
320+ - Daily writes : ` /backup/openstack-ctlplane-backup-TIMESTAMP2.tar.gz`
321+ - Different timestamps = different files = no conflict
322+
323+ **OADP backup naming:**
324+ - Each backup CR creates unique OADP backup :
325+ - Hourly : ` openstack-volumes-TIMESTAMP1`
326+ - Daily : ` openstack-volumes-TIMESTAMP2`
327+ - No OADP conflicts
328+
329+ **Galera backup jobs:**
330+ - Each backup triggers separate Galera backup job
331+ - Job names include timestamp : ` backup-openstack-TIMESTAMP`
332+ - No job conflicts
333+
334+ **Best practices:**
335+ - **Stagger schedules** to avoid concurrent execution (recommended)
336+ - If concurrent execution occurs, it works but creates duplicate snapshots at same point-in-time
337+ - OADP can handle concurrent snapshot operations on different PVCs
338+
339+ # ## CronJob Lifecycle
340+
341+ **CronJob management:**
342+
343+ **Creation:**
344+ - BackupController creates CronJob when backup CR has `schedule` field
345+ - CronJob name : ` <backup-cr-name>-cronjob`
346+ - Example : ` ctlplane-backup-daily-cronjob`
347+
348+ **CronJob spec:**
349+ ` ` ` yaml
350+ apiVersion: batch/v1
351+ kind: CronJob
352+ metadata:
353+ name: ctlplane-backup-daily-cronjob
354+ namespace: openstack
355+ ownerReferences:
356+ - apiVersion: infra.openstack.org/v1beta1
357+ kind: OpenStackControlPlaneBackup
358+ name: ctlplane-backup-daily
359+ spec:
360+ schedule: "0 2 * * *"
361+ jobTemplate:
362+ spec:
363+ template:
364+ metadata:
365+ generateName: ctlplane-backup-daily-
366+ # Creates backup CR instance for each run
367+ ` ` `
368+
369+ **Schedule changes:**
370+ - Controller watches backup CR for changes
371+ - If `schedule` field updated, controller updates CronJob schedule
372+ - No recreation needed, just patch CronJob spec
373+
374+ **Deletion:**
375+ - CronJob has ownerReference to backup CR
376+ - Deleting backup CR automatically deletes CronJob
377+ - Kubernetes garbage collection handles cleanup
378+
379+ **Job instances:**
380+ - Each CronJob execution creates a backup CR instance
381+ - Instance name : ` <backup-cr-name>-TIMESTAMP`
382+ - Example : ` ctlplane-backup-daily-20260302-020000`
383+ - Instances are independent CRs (can be listed, monitored separately)
384+
278385# # Related Components
279386
280387These backup/restore capabilities complement but remain separate from :
281388- **GaleraBackup/GaleraRestore** - In mariadb-operator, database-specific dumps
282389- **OVNBackup/OVNRestore** - To be implemented, OVN database-specific
283390- **test-operator** - Used for post-restore validation (Tempest tests)
284391
285- The generic backup/ restore controller in openstack-operator orchestrates the full backup/restore workflow using playbooks, while these components handle specific subsystems.
392+ The backup and restore controllers in openstack-operator orchestrate the full backup/restore workflow using playbooks, while these components handle specific subsystems.
286393
287394# # User Workflow
288395
@@ -809,27 +916,40 @@ oc create namespace openstack-backup-inspection
809916# 2. Restore archive PVC to temp namespace (see above)
810917
811918# 3. Create helper pod to inspect
812- oc run -n openstack-backup-inspection archive-inspector \
813- --image=registry.redhat.io/ubi9/ubi:latest \
814- --command -- sleep infinity
919+ cat <<EOF | oc apply -f -
920+ apiVersion: v1
921+ kind: Pod
922+ metadata:
923+ name: archive-inspector
924+ namespace: openstack-backup-inspection
925+ spec:
926+ containers:
927+ - name: inspector
928+ image: registry.redhat.io/ubi9/ubi:latest
929+ command: ["/bin/bash", "-c", "sleep infinity"]
930+ volumeMounts:
931+ - name: backup
932+ mountPath: /backup
933+ volumes:
934+ - name: backup
935+ persistentVolumeClaim:
936+ claimName: openstack-backup-storage
937+ EOF
815938
816- oc set volume -n openstack-backup-inspection \
817- deployment/archive-inspector \
818- --add --mount-path=/backup \
819- --name=backup-pvc \
820- --claim-name=openstack-backup-storage
939+ # 4. Wait for pod to be ready
940+ oc wait -n openstack-backup-inspection pod/archive-inspector --for=condition=Ready --timeout=60s
821941
822- # 4 . Inspect archive
942+ # 5 . Inspect archive
823943oc exec -n openstack-backup-inspection archive-inspector -- \
824944 tar -tzf /backup/openstack-ctlplane-backup-20260302.tar.gz
825945
826946oc exec -n openstack-backup-inspection archive-inspector -- \
827947 cat /backup/openstack-ctlplane-backup-20260302/operator-versions.txt
828948
829- # 5 . If good, cleanup temp and do real restore
949+ # 6 . If good, cleanup temp and do real restore
830950oc delete namespace openstack-backup-inspection
831951
832- # 6 . Full restore to production namespace
952+ # 7 . Full restore to production namespace
833953# (see disaster recovery workflow below)
834954` ` `
835955
@@ -851,6 +971,18 @@ spec:
851971
852972**Decision:** No dedicated metadata PVC needed. Use OADP backup list + backup archive contents.
853973
974+ **OADP restore workflow:**
975+
976+ The restore process has two stages :
977+ 1. **Manual OADP restore** - Restore PVCs from S3 snapshots (prerequisite)
978+ 2. **OpenStackControlPlaneRestore CR** - Restore CRs, databases, resume deployment
979+
980+ **Why manual OADP restore first?**
981+ - RestoreController needs backup archive from PVC to validate operator versions
982+ - Cannot trigger OADP restore automatically without reading metadata first
983+ - Chicken-and-egg : need PVC to read metadata, need metadata to know which OADP backup to restore
984+ - Solution : User manually restores PVCs, then RestoreController takes over
985+
854986**Disaster recovery workflow (fresh cluster):**
855987
856988` ` ` bash
@@ -943,13 +1075,13 @@ Controller reads `operator-versions.txt` from restored archive and validates:
9431075 - **Mitigation:** Validate playbook syntax before execution, provide clear error messages
9441076
94510772. **Risk:** Backup job failure leaves cluster in unknown state
946- - **Mitigation:** Transactional approach, status tracking, rollback capability
1078+ - **Mitigation:** Transactional approach, status tracking, clear failure status
9471079
948- 3. **Risk:** Retention policy deletes backups too aggressively
949- - **Mitigation:** Clear defaults, user confirmation for manual deletes
1080+ 3. **Risk:** Incorrect OADP TTL configuration deletes backups too quickly
1081+ - **Mitigation:** Sensible defaults (30 days), clear documentation, validate TTL values
9501082
95110834. **Risk:** Controller complexity (managing Jobs, CronJobs, OADP CRs)
952- - **Mitigation:** Start simple, iterate based on feedback
1084+ - **Mitigation:** Start simple, iterate based on feedback, maximum logic in playbooks
9531085
9541086# # Implementation Plan
9551087
@@ -970,8 +1102,9 @@ Controller reads `operator-versions.txt` from restored archive and validates:
9701102- Controller-managed CronJob lifecycle
9711103
9721104**Phase 4: DataPlane backup/restore**
973- - Implement OpenStackDataPlaneBackup/Restore controllers
974- - Follow same pattern as ControlPlane
1105+ - Add DataPlane CR support to existing controllers
1106+ - DataPlane playbooks (already exist)
1107+ - Follow same pattern as ControlPlane (same controllers handle both)
9751108
9761109**Phase 5: Advanced features**
9771110- Playbook override support
0 commit comments