Skip to content

Commit df5aff1

Browse files
stuggiclaude
andcommitted
Fix inconsistencies in backup/restore CRD design
Changes: 1. Controller architecture: Two controllers (Backup and Restore) - BackupController handles both ControlPlane/DataPlane backups - RestoreController handles both ControlPlane/DataPlane restores - Different responsibilities justify separation 2. PVC management: BackupController creates PVC - Removes user prerequisite to pre-create PVC - Controller ensures PVC exists with correct labels - User can optionally specify size and storageClass 3. Fixed operator location references - Changed "infra-operator" to "openstack-operator" - Consistent with Decision #1 4. Removed retentionPolicy from CRD specs - OADP TTL handles retention (per Decision openstack-k8s-operators#3) - Added oadp.ttl and oadp.snapshotMoveData to specs 5. Updated "Separation of Concerns" to "Related Components" - Reflects two-controller design - Clarifies relationship with GaleraBackup, test-operator 6. Updated implementation plan Phase 3 - Changed "retention policy" to "OADP TTL configuration" 7. Updated discussion topics to open implementation questions - Removed answered questions - Added relevant implementation details Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent ba0e0d7 commit df5aff1

1 file changed

Lines changed: 116 additions & 59 deletions

File tree

docs/dev/backup-restore-crd-design.md

Lines changed: 116 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Design for implementing CRD-based backup/restore automation for OpenStack contro
3232

3333
### CRDs
3434

35-
Four new CRDs in infra-operator (or new backup-operator):
35+
Four new CRDs in openstack-operator:
3636

3737
```yaml
3838
# 1. OpenStackControlPlaneBackup
@@ -56,15 +56,13 @@ spec:
5656
oadp:
5757
enabled: true
5858
namespace: openshift-adp
59+
snapshotMoveData: true # Copy snapshot data to S3 (default: true)
60+
ttl: 720h # Backup retention (default: 30 days)
5961

6062
# Backup storage location
6163
storage:
6264
pvc: openstack-backup-storage # PVC to store backup archives
6365

64-
# Retention policy
65-
retentionPolicy:
66-
keep: 7 # Keep last 7 backups
67-
6866
status:
6967
phase: Completed # Pending, Running, Completed, Failed
7068
backupName: openstack-ctlplane-backup-20260302-020000
@@ -89,10 +87,13 @@ spec:
8987
namespace: openstack
9088
schedule: "0 3 * * *"
9189
playbookOverride: my-custom-dataplane-backup
90+
oadp:
91+
enabled: true
92+
namespace: openshift-adp
93+
snapshotMoveData: true
94+
ttl: 720h
9295
storage:
9396
pvc: openstack-backup-storage
94-
retentionPolicy:
95-
keep: 7
9697

9798
status:
9899
phase: Completed
@@ -165,13 +166,27 @@ status:
165166
166167
### Controller Implementation
167168
168-
**Location:** infra-operator (or new backup-operator if it grows large)
169+
**Location:** openstack-operator
170+
171+
**Two Controllers:**
169172
170-
**Controllers:**
171-
1. `OpenStackControlPlaneBackup` controller
172-
2. `OpenStackDataPlaneBackup` controller
173-
3. `OpenStackControlPlaneRestore` controller
174-
4. `OpenStackDataPlaneRestore` controller
173+
1. **BackupController** - Handles backup operations
174+
- Watches: `OpenStackControlPlaneBackup` and `OpenStackDataPlaneBackup` CRs
175+
- Responsibilities:
176+
- Create/manage backup storage PVC
177+
- Create ansible-runner Job with backup playbook
178+
- Trigger OADP backup
179+
- Update backup CR status
180+
- Manage CronJobs for scheduled backups
181+
182+
2. **RestoreController** - Handles restore operations
183+
- Watches: `OpenStackControlPlaneRestore` and `OpenStackDataPlaneRestore` CRs
184+
- Responsibilities:
185+
- Validate backup exists and is compatible
186+
- Create ansible-runner Job with restore playbook
187+
- Monitor multi-stage restore progress
188+
- Update restore CR status
189+
- Trigger post-restore validation (optional)
175190

176191
**Execution Model:** Ansible Runner (like OpenStackDataPlaneDeployment)
177192

@@ -233,9 +248,21 @@ data:
233248
spec:
234249
storage:
235250
pvc: openstack-backup-storage
251+
size: 10Gi # Optional, default: 10Gi
252+
storageClass: lvms-vg1 # Optional, uses default if not specified
236253
```
237254

238-
Controller mounts this PVC into ansible-runner Job. Backup archives written to `/backup/`.
255+
**PVC Management:**
256+
- **BackupController creates PVC** if it doesn't exist
257+
- PVC labeled with `openstack.org/backup=true` (included in OADP backup)
258+
- Access mode: `ReadWriteMany` (for concurrent backup jobs)
259+
- Controller mounts PVC into ansible-runner Job
260+
- Backup archives written to `/backup/`
261+
262+
**PVC lifecycle:**
263+
- Created on first backup
264+
- Persists across backups (archives overwritten)
265+
- User can pre-create PVC with specific settings if needed
239266

240267
**Future:** Could support external storage (S3, NFS) directly:
241268

@@ -248,19 +275,14 @@ spec:
248275
credentialsSecret: s3-creds
249276
```
250277

251-
## Separation of Concerns
278+
## Related Components
252279

253-
Keep controllers separate initially:
254-
- **GaleraBackup/GaleraRestore** - Already in mariadb-operator, database-specific
280+
These backup/restore capabilities complement but remain separate from:
281+
- **GaleraBackup/GaleraRestore** - In mariadb-operator, database-specific dumps
255282
- **OVNBackup/OVNRestore** - To be implemented, OVN database-specific
256-
- **OpenStackControlPlaneBackup/Restore** - New, orchestrates full control plane backup
257-
- **OpenStackDataPlaneBackup/Restore** - New, handles data plane backup
283+
- **test-operator** - Used for post-restore validation (Tempest tests)
258284

259-
**Rationale:**
260-
- Each has different concerns (database dump vs CR backup vs network topology)
261-
- Different playbooks and logic
262-
- Different status tracking needs
263-
- Can evaluate generic backup controller later if patterns emerge
285+
The generic backup/restore controller in openstack-operator orchestrates the full backup/restore workflow using playbooks, while these components handle specific subsystems.
264286

265287
## User Workflow
266288

@@ -306,10 +328,10 @@ spec:
306328
oadp:
307329
enabled: true
308330
namespace: openshift-adp
331+
snapshotMoveData: true
332+
ttl: 720h # 30 days
309333
storage:
310334
pvc: openstack-backup-storage
311-
retentionPolicy:
312-
keep: 7
313335
EOF
314336
315337
# Controller creates CronJob, which creates backup CR instances daily
@@ -646,24 +668,37 @@ status:
646668

647669
**Future enhancement:** Support custom validation workflows (user-provided tests).
648670

649-
### 6. Generic vs specialized controllers: **Start with generic, separate if needed**
671+
### 6. Controller architecture: **Two controllers (Backup and Restore)**
650672

651-
**Decision:** Start with one generic controller. Keep logic in playbooks for flexibility.
673+
**Decision:** Two controllers - BackupController and RestoreController.
652674

653675
**Rationale:**
654676

677+
**Different responsibilities:**
678+
- **Backup:** Resource creation (PVCs, Jobs, OADP backups, CronJobs)
679+
- **Restore:** Resource consumption, multi-stage orchestration, validation
680+
- Different reconciliation patterns justify separate controllers
681+
682+
**BackupController responsibilities (minimal Go logic):**
683+
- Create/manage backup storage PVC (if doesn't exist)
684+
- Watch ControlPlane and DataPlane Backup CRs
685+
- Create ansible-runner Job with backup playbook
686+
- Trigger OADP backup via playbook
687+
- Create CronJob for scheduled backups
688+
- Update backup CR status
689+
690+
**RestoreController responsibilities (minimal Go logic):**
691+
- Watch ControlPlane and DataPlane Restore CRs
692+
- Validate backup compatibility (operator versions)
693+
- Create ansible-runner Job with restore playbook
694+
- Monitor multi-stage restore (via playbook status)
695+
- Update restore CR status with stage tracking
696+
655697
**Key operational requirement: Customizability**
656698
- If customer hits bug in production → needs immediate fix without operator release
657699
- Playbook override allows emergency fixes, skipping broken steps, environment-specific logic
658700
- Same pattern as EDPM (proven in production)
659701

660-
**Controller responsibility (minimal Go logic):**
661-
- Watch for Backup/Restore CRs (any type)
662-
- Create ansible-runner Job with correct playbook + variables
663-
- Monitor Job completion
664-
- Update CR status from Job output
665-
- Basic error handling and retries
666-
667702
**Playbook responsibility (maximum logic, customizable):**
668703
- All backup/restore steps
669704
- OADP Backup/Restore CR creation (`oc apply`)
@@ -672,36 +707,57 @@ status:
672707
- Deployment resumption (`oc annotate`)
673708
- Validation (trigger Tempest via `oc apply`)
674709

675-
**Generic controller approach:**
710+
**Controller pseudo-code:**
676711
```go
677-
// Simplified pseudo-code
678-
func Reconcile(cr BackupOrRestoreCR) {
712+
// BackupController
713+
func (r *BackupController) Reconcile(backup *Backup) {
714+
// Ensure PVC exists
715+
ensurePVCExists(backup.Spec.Storage.PVC)
716+
679717
// Determine playbook based on CR kind
680-
playbook := getPlaybook(cr.Kind)
718+
playbook := getBackupPlaybook(backup.Kind)
681719
// backup-openstack-ctlplane.yaml
682720
// backup-openstack-dataplane.yaml
721+
722+
// Create ansible-runner Job
723+
job := createAnsibleRunnerJob(playbook, backup.Spec)
724+
725+
// Update status
726+
updateStatusFromJob(backup, job)
727+
728+
// Create CronJob if schedule specified
729+
if backup.Spec.Schedule != "" {
730+
ensureCronJobExists(backup)
731+
}
732+
}
733+
734+
// RestoreController
735+
func (r *RestoreController) Reconcile(restore *Restore) {
736+
// Validate backup
737+
if err := validateBackupCompatibility(restore); err != nil {
738+
return err
739+
}
740+
741+
// Determine playbook
742+
playbook := getRestorePlaybook(restore.Kind)
683743
// restore-openstack-ctlplane.yaml
684744
// restore-openstack-dataplane.yaml
685745
686746
// Create ansible-runner Job
687-
job := createAnsibleRunnerJob(playbook, cr.Spec)
747+
job := createAnsibleRunnerJob(playbook, restore.Spec)
688748
689-
// Monitor and update status
690-
updateStatusFromJob(cr, job)
749+
// Update status with stage tracking
750+
updateStatusFromJob(restore, job)
691751
}
692752
```
693753

694754
**Benefits:**
695-
- ✅ Simple controller implementation
755+
- ✅ Clear separation: backup creates, restore consumes
756+
- ✅ BackupController manages PVC lifecycle
757+
- ✅ Different reconciliation logic per controller
696758
- ✅ Maximum flexibility via playbook override
697759
- ✅ Customer can fix bugs immediately
698-
- ✅ Less Go code to maintain
699-
- ✅ Shared logic for all backup/restore types
700-
701-
**When to separate:**
702-
If we find significant Go logic is needed per-type (complex OADP orchestration, different reconciliation patterns), we can split into separate controllers later.
703-
704-
**Start simple, separate if complexity demands it.**
760+
- ✅ Each controller handles both ControlPlane and DataPlane (similar logic)
705761

706762
### 7. OADP backup strategy: **One atomic backup for all PVCs**
707763

@@ -908,10 +964,10 @@ Controller reads `operator-versions.txt` from restored archive and validates:
908964
- Execute existing restore playbook
909965
- Multi-stage status tracking
910966

911-
**Phase 3: Scheduling and retention**
967+
**Phase 3: Scheduling**
912968
- Add schedule support (CronJob-based)
913-
- Implement retention policy
914-
- Auto-cleanup old backups
969+
- Configure OADP TTL from CR spec
970+
- Controller-managed CronJob lifecycle
915971

916972
**Phase 4: DataPlane backup/restore**
917973
- Implement OpenStackDataPlaneBackup/Restore controllers
@@ -930,10 +986,11 @@ Controller reads `operator-versions.txt` from restored archive and validates:
930986
- OpenStackDataPlaneDeployment: `dataplane-operator` repository
931987
- OADP Backup/Restore: Velero CRDs
932988

933-
## Discussion Topics
989+
## Open Implementation Questions
934990

935-
1. Should we start with ControlPlane backup/restore or do Galera/OVN first?
936-
2. Where should these controllers live (which operator)?
937-
3. Do we need a separate backup-operator or can it live in infra-operator?
938-
4. Should playbook override be in Phase 1 or later?
939-
5. How should we handle OADP backup retention separately from archive retention?
991+
1. Should playbook override be in Phase 1 or later phases?
992+
2. How to handle ansible-runner Job cleanup (retention of completed jobs)?
993+
3. Status update frequency during long-running operations?
994+
4. Error handling strategy: retries, backoff, manual intervention?
995+
5. RBAC model: cluster-admin or limited permissions?
996+
6. Should we support backing up to multiple OADP backends simultaneously?

0 commit comments

Comments
 (0)