Skip to content

Commit cc14d1a

Browse files
stuggiclaude
andcommitted
Fix remaining issues in backup/restore CRD design
Changes: 1. Fixed retention policy references - Changed "old backup deletion based on retention policy" to "Job cleanup (ansible-runner Jobs)" - OADP TTL handles backup retention, not controller 2. Fixed controller reference - "generic controller" → "backup and restore controllers" 3. Updated risks section - Changed retention policy risk to OADP TTL configuration risk - Added mitigation: sensible defaults, validation 4. Clarified Implementation Plan Phase 4 - DataPlane support added to existing controllers - Not separate controllers (same BackupController/RestoreController) 5. Added CRD field defaults and validation - oadp.snapshotMoveData: true (default) - oadp.ttl: 720h (default) - storage.size: 10Gi (default) - Validation rules documented 6. Clarified OADP restore workflow - Manual OADP restore required first (restore PVCs) - Then OpenStackControlPlaneRestore CR (restore CRs/DB) - Explained chicken-and-egg problem 7. Fixed Pod vs Deployment confusion - Changed oc run + oc set volume to proper Pod manifest - Added wait for pod ready step 8. Added concurrent backup handling section - RWX PVC supports concurrent access - Different timestamps avoid file conflicts - Recommendation: stagger schedules 9. Added CronJob lifecycle section - CronJob naming convention - Schedule update handling - Automatic cleanup via ownerReferences - Job instance naming Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent a8b8b4b commit cc14d1a

1 file changed

Lines changed: 152 additions & 19 deletions

File tree

docs/dev/backup-restore-crd-design.md

Lines changed: 152 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,40 @@ status:
164164
completionTimestamp: "2026-03-02T08:32:00Z"
165165
```
166166
167+
### CRD Field Defaults and Validation
168+
169+
**Backup CR defaults:**
170+
```yaml
171+
spec:
172+
oadp:
173+
enabled: true # OADP integration enabled by default
174+
snapshotMoveData: true # Default: copy to S3 (production-safe)
175+
ttl: 720h # Default: 30 days retention
176+
namespace: openshift-adp # Default OADP namespace
177+
storage:
178+
pvc: openstack-backup-storage # Default PVC name
179+
size: 10Gi # Default PVC size (if controller creates it)
180+
storageClass: "" # Default: use cluster default StorageClass
181+
```
182+
183+
**Restore CR defaults:**
184+
```yaml
185+
spec:
186+
oadp:
187+
enabled: true # OADP integration enabled by default
188+
namespace: openshift-adp # Default OADP namespace
189+
storage:
190+
pvc: openstack-backup-storage # Default PVC name (must match backup)
191+
validation:
192+
enabled: false # Default: no automatic validation
193+
```
194+
195+
**Validation rules:**
196+
- `oadp.ttl` must be >= 1h
197+
- `storage.size` must be >= 1Gi
198+
- `storage.pvc` required (can be defaulted)
199+
- `backupName` required for restore
200+
167201
### Controller Implementation
168202

169203
**Location:** openstack-operator
@@ -199,7 +233,7 @@ Each controller:
199233
- `docs/dev/playbooks/restore-openstack-ctlplane.yaml`
200234
- `docs/dev/playbooks/restore-openstack-dataplane.yaml`
201235
4. Updates CR status based on Job progress
202-
5. Handles cleanup (Job retention, old backup deletion based on retention policy)
236+
5. Handles Job cleanup (retention of completed ansible-runner Jobs)
203237

204238
### Playbook Override Mechanism
205239

@@ -275,14 +309,87 @@ spec:
275309
credentialsSecret: s3-creds
276310
```
277311

312+
### Concurrent Backup Handling
313+
314+
**Question:** What if hourly and daily backups run simultaneously?
315+
316+
**Answer:** PVC with ReadWriteMany (RWX) supports concurrent access, but:
317+
318+
**Archive file conflict:**
319+
- Hourly writes: `/backup/openstack-ctlplane-backup-TIMESTAMP1.tar.gz`
320+
- Daily writes: `/backup/openstack-ctlplane-backup-TIMESTAMP2.tar.gz`
321+
- Different timestamps = different files = no conflict
322+
323+
**OADP backup naming:**
324+
- Each backup CR creates unique OADP backup:
325+
- Hourly: `openstack-volumes-TIMESTAMP1`
326+
- Daily: `openstack-volumes-TIMESTAMP2`
327+
- No OADP conflicts
328+
329+
**Galera backup jobs:**
330+
- Each backup triggers separate Galera backup job
331+
- Job names include timestamp: `backup-openstack-TIMESTAMP`
332+
- No job conflicts
333+
334+
**Best practices:**
335+
- **Stagger schedules** to avoid concurrent execution (recommended)
336+
- If concurrent execution occurs, it works but creates duplicate snapshots at same point-in-time
337+
- OADP can handle concurrent snapshot operations on different PVCs
338+
339+
### CronJob Lifecycle
340+
341+
**CronJob management:**
342+
343+
**Creation:**
344+
- BackupController creates CronJob when backup CR has `schedule` field
345+
- CronJob name: `<backup-cr-name>-cronjob`
346+
- Example: `ctlplane-backup-daily-cronjob`
347+
348+
**CronJob spec:**
349+
```yaml
350+
apiVersion: batch/v1
351+
kind: CronJob
352+
metadata:
353+
name: ctlplane-backup-daily-cronjob
354+
namespace: openstack
355+
ownerReferences:
356+
- apiVersion: infra.openstack.org/v1beta1
357+
kind: OpenStackControlPlaneBackup
358+
name: ctlplane-backup-daily
359+
spec:
360+
schedule: "0 2 * * *"
361+
jobTemplate:
362+
spec:
363+
template:
364+
metadata:
365+
generateName: ctlplane-backup-daily-
366+
# Creates backup CR instance for each run
367+
```
368+
369+
**Schedule changes:**
370+
- Controller watches backup CR for changes
371+
- If `schedule` field updated, controller updates CronJob schedule
372+
- No recreation needed, just patch CronJob spec
373+
374+
**Deletion:**
375+
- CronJob has ownerReference to backup CR
376+
- Deleting backup CR automatically deletes CronJob
377+
- Kubernetes garbage collection handles cleanup
378+
379+
**Job instances:**
380+
- Each CronJob execution creates a backup CR instance
381+
- Instance name: `<backup-cr-name>-TIMESTAMP`
382+
- Example: `ctlplane-backup-daily-20260302-020000`
383+
- Instances are independent CRs (can be listed, monitored separately)
384+
278385
## Related Components
279386

280387
These backup/restore capabilities complement but remain separate from:
281388
- **GaleraBackup/GaleraRestore** - In mariadb-operator, database-specific dumps
282389
- **OVNBackup/OVNRestore** - To be implemented, OVN database-specific
283390
- **test-operator** - Used for post-restore validation (Tempest tests)
284391

285-
The generic backup/restore controller in openstack-operator orchestrates the full backup/restore workflow using playbooks, while these components handle specific subsystems.
392+
The backup and restore controllers in openstack-operator orchestrate the full backup/restore workflow using playbooks, while these components handle specific subsystems.
286393

287394
## User Workflow
288395

@@ -809,27 +916,40 @@ oc create namespace openstack-backup-inspection
809916
# 2. Restore archive PVC to temp namespace (see above)
810917
811918
# 3. Create helper pod to inspect
812-
oc run -n openstack-backup-inspection archive-inspector \
813-
--image=registry.redhat.io/ubi9/ubi:latest \
814-
--command -- sleep infinity
919+
cat <<EOF | oc apply -f -
920+
apiVersion: v1
921+
kind: Pod
922+
metadata:
923+
name: archive-inspector
924+
namespace: openstack-backup-inspection
925+
spec:
926+
containers:
927+
- name: inspector
928+
image: registry.redhat.io/ubi9/ubi:latest
929+
command: ["/bin/bash", "-c", "sleep infinity"]
930+
volumeMounts:
931+
- name: backup
932+
mountPath: /backup
933+
volumes:
934+
- name: backup
935+
persistentVolumeClaim:
936+
claimName: openstack-backup-storage
937+
EOF
815938
816-
oc set volume -n openstack-backup-inspection \
817-
deployment/archive-inspector \
818-
--add --mount-path=/backup \
819-
--name=backup-pvc \
820-
--claim-name=openstack-backup-storage
939+
# 4. Wait for pod to be ready
940+
oc wait -n openstack-backup-inspection pod/archive-inspector --for=condition=Ready --timeout=60s
821941
822-
# 4. Inspect archive
942+
# 5. Inspect archive
823943
oc exec -n openstack-backup-inspection archive-inspector -- \
824944
tar -tzf /backup/openstack-ctlplane-backup-20260302.tar.gz
825945
826946
oc exec -n openstack-backup-inspection archive-inspector -- \
827947
cat /backup/openstack-ctlplane-backup-20260302/operator-versions.txt
828948
829-
# 5. If good, cleanup temp and do real restore
949+
# 6. If good, cleanup temp and do real restore
830950
oc delete namespace openstack-backup-inspection
831951
832-
# 6. Full restore to production namespace
952+
# 7. Full restore to production namespace
833953
# (see disaster recovery workflow below)
834954
```
835955

@@ -851,6 +971,18 @@ spec:
851971

852972
**Decision:** No dedicated metadata PVC needed. Use OADP backup list + backup archive contents.
853973

974+
**OADP restore workflow:**
975+
976+
The restore process has two stages:
977+
1. **Manual OADP restore** - Restore PVCs from S3 snapshots (prerequisite)
978+
2. **OpenStackControlPlaneRestore CR** - Restore CRs, databases, resume deployment
979+
980+
**Why manual OADP restore first?**
981+
- RestoreController needs backup archive from PVC to validate operator versions
982+
- Cannot trigger OADP restore automatically without reading metadata first
983+
- Chicken-and-egg: need PVC to read metadata, need metadata to know which OADP backup to restore
984+
- Solution: User manually restores PVCs, then RestoreController takes over
985+
854986
**Disaster recovery workflow (fresh cluster):**
855987

856988
```bash
@@ -943,13 +1075,13 @@ Controller reads `operator-versions.txt` from restored archive and validates:
9431075
- **Mitigation:** Validate playbook syntax before execution, provide clear error messages
9441076

9451077
2. **Risk:** Backup job failure leaves cluster in unknown state
946-
- **Mitigation:** Transactional approach, status tracking, rollback capability
1078+
- **Mitigation:** Transactional approach, status tracking, clear failure status
9471079

948-
3. **Risk:** Retention policy deletes backups too aggressively
949-
- **Mitigation:** Clear defaults, user confirmation for manual deletes
1080+
3. **Risk:** Incorrect OADP TTL configuration deletes backups too quickly
1081+
- **Mitigation:** Sensible defaults (30 days), clear documentation, validate TTL values
9501082

9511083
4. **Risk:** Controller complexity (managing Jobs, CronJobs, OADP CRs)
952-
- **Mitigation:** Start simple, iterate based on feedback
1084+
- **Mitigation:** Start simple, iterate based on feedback, maximum logic in playbooks
9531085

9541086
## Implementation Plan
9551087

@@ -970,8 +1102,9 @@ Controller reads `operator-versions.txt` from restored archive and validates:
9701102
- Controller-managed CronJob lifecycle
9711103

9721104
**Phase 4: DataPlane backup/restore**
973-
- Implement OpenStackDataPlaneBackup/Restore controllers
974-
- Follow same pattern as ControlPlane
1105+
- Add DataPlane CR support to existing controllers
1106+
- DataPlane playbooks (already exist)
1107+
- Follow same pattern as ControlPlane (same controllers handle both)
9751108

9761109
**Phase 5: Advanced features**
9771110
- Playbook override support

0 commit comments

Comments
 (0)