Skip to content

Commit 23672aa

Browse files
stuggiclaude
andcommitted
[b/r] Add Phase 4 backup/restore controller design and OADP docs link
Design doc updates: - Phase 4: generic template-based backup/restore controllers using Secret templates (no backup tool imports, unstructured.Unstructured) - Scheduled backups via schedule field on OpenStackBackup CR - Add certified backup storage providers link to OADP Integration section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f24e747 commit 23672aa

1 file changed

Lines changed: 192 additions & 13 deletions

File tree

docs/dev/backup-restore/backup-restore-controller-design.md

Lines changed: 192 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,11 @@ func getBackupLabels(annotations map[string]string) map[string]string {
320320

321321
### OADP Integration
322322

323+
OADP (OpenShift API for Data Protection) is Red Hat's distribution of Velero for
324+
OpenShift. It requires an S3-compatible object storage backend for backup storage.
325+
See [Certified backup storage providers](https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/backup_and_restore/oadp-application-backup-and-restore#oadp-certified-backup-storage-providers_about-installing-oadp)
326+
for supported backends (AWS S3, MCG, ODF, Ceph RGW, MinIO, etc.).
327+
323328
#### Split Backup: CRs and PVCs
324329

325330
The backup uses two OADP Backup CRs — one for all namespace resources (CRs, Secrets, ConfigMaps, etc.) and one for PVC snapshots:
@@ -1133,35 +1138,208 @@ Use cases:
11331138
- Interactive pauses between steps (skippable with `auto_ack=true`)
11341139
- See `docs/dev/backup-restore/restore/README.md` for usage
11351140

1136-
### Phase 4: Golang Restore Controller (Future)
1141+
### Phase 4: Golang Backup/Restore Controllers (Future)
1142+
1143+
**Goal**: Replace the Ansible playbooks with Golang controllers that orchestrate
1144+
backup and restore — all driven by CRs. The controllers are **backup-tool-agnostic**:
1145+
they use raw templates from a Secret to create backup/restore CRs for whatever
1146+
tool is configured (OADP/Velero, Kasten, etc.), with no tool-specific imports.
1147+
1148+
#### Generic Template Approach
1149+
1150+
Each stage references a key in a Secret containing the raw YAML template for the
1151+
backup/restore CR to create. The controller renders the template with variables
1152+
(namespace, timestamp, etc.), creates the `unstructured.Unstructured` object, and
1153+
polls a configurable jsonpath condition for completion.
1154+
1155+
This keeps the controller decoupled from any backup tool — only the Secret
1156+
templates contain tool-specific API references.
11371157

1138-
**Goal**: Replace the Ansible restore playbook with a Golang controller that
1139-
creates OADP Restore CRs, handles database/RabbitMQ restore, and manages the
1140-
staged deployment lifecycle — all driven by a single `OpenStackBackupRestore` CR.
1158+
#### OpenStackBackup Controller
1159+
1160+
Orchestrates the full backup sequence: trigger Galera DB dumps, then create
1161+
backup CRs from templates for each stage.
11411162

11421163
```yaml
11431164
apiVersion: backup.openstack.org/v1beta1
1144-
kind: OpenStackBackupRestore
1165+
kind: OpenStackBackup
11451166
metadata:
1146-
name: restore-20260303
1167+
name: backup-20260320
11471168
namespace: openstack
11481169
spec:
1149-
backupName: openstack-backup-20260303-120000
1150-
automated: true
1170+
stages:
1171+
- name: galera-dumps
1172+
type: GaleraBackup # built-in: triggers jobs from GaleraBackup cronjobs
1173+
timeout: 10m
1174+
- name: pvc-backup
1175+
type: Template
1176+
templateRef:
1177+
name: openstack-backup-templates # Secret name
1178+
key: backup-pvcs # Key within the Secret
1179+
completionCondition:
1180+
jsonpath: '{.status.phase}'
1181+
value: Completed
1182+
timeout: 30m
1183+
- name: resources-backup
1184+
type: Template
1185+
templateRef:
1186+
name: openstack-backup-templates
1187+
key: backup-resources
1188+
completionCondition:
1189+
jsonpath: '{.status.phase}'
1190+
value: Completed
1191+
timeout: 30m
1192+
status:
1193+
phase: InProgress # Pending, InProgress, Completed, Failed
1194+
currentStage: pvc-backup
1195+
conditions:
1196+
- type: GaleraDumpsComplete
1197+
status: "True"
1198+
- type: PvcBackupInProgress
1199+
status: "True"
1200+
```
1201+
1202+
The Secret contains the tool-specific templates:
1203+
1204+
```yaml
1205+
apiVersion: v1
1206+
kind: Secret
1207+
metadata:
1208+
name: openstack-backup-templates
1209+
namespace: openstack
1210+
stringData:
1211+
backup-pvcs: |
1212+
apiVersion: velero.io/v1
1213+
kind: Backup
1214+
metadata:
1215+
name: openstack-backup-pvcs-{{ .Timestamp }}
1216+
namespace: {{ .OADPNamespace }}
1217+
spec:
1218+
includedNamespaces:
1219+
- {{ .Namespace }}
1220+
labelSelector:
1221+
matchLabels:
1222+
backup.openstack.org/backup: "true"
1223+
snapshotMoveData: true
1224+
storageLocation: velero-1
1225+
backup-resources: |
1226+
apiVersion: velero.io/v1
1227+
kind: Backup
1228+
metadata:
1229+
name: openstack-backup-resources-{{ .Timestamp }}
1230+
namespace: {{ .OADPNamespace }}
1231+
spec:
1232+
includedNamespaces:
1233+
- {{ .Namespace }}
1234+
labelSelector:
1235+
matchLabels:
1236+
backup.openstack.org/restore: "true"
1237+
snapshotVolumes: false
1238+
storageLocation: velero-1
1239+
```
1240+
1241+
#### OpenStackRestore Controller
1242+
1243+
Orchestrates the full restore sequence using the same template-based approach,
1244+
plus built-in stages for database restore, RabbitMQ credential restore, and
1245+
staged deployment lifecycle.
1246+
1247+
```yaml
1248+
apiVersion: backup.openstack.org/v1beta1
1249+
kind: OpenStackRestore
1250+
metadata:
1251+
name: restore-20260320
1252+
namespace: openstack
1253+
spec:
1254+
backupTimestamp: "20260320-110200"
1255+
templateRef:
1256+
name: openstack-restore-templates # Secret with all restore stage templates
11511257
automatedDatabaseRestore: true
11521258
automatedRabbitMQRestore: true
11531259
status:
11541260
phase: InProgress # Pending, InProgress, Completed, Failed
1155-
currentRestoreOrder: 20
1261+
currentStage: order-20-infra
11561262
conditions:
1157-
- type: Order00Complete
1263+
- type: Order00PvcsComplete
11581264
status: "True"
1159-
- type: Order10Complete
1265+
- type: Order10FoundationComplete
11601266
status: "True"
1161-
- type: Order20InProgress
1267+
- type: Order20InfraInProgress
11621268
status: "True"
11631269
```
11641270

1271+
#### Scheduled Backups
1272+
1273+
The `OpenStackBackup` CR supports an optional `schedule` field for recurring
1274+
backups. When set, the controller creates a CronJob that produces new
1275+
`OpenStackBackup` instances on schedule.
1276+
1277+
```yaml
1278+
apiVersion: backup.openstack.org/v1beta1
1279+
kind: OpenStackBackup
1280+
metadata:
1281+
name: daily-backup
1282+
namespace: openstack
1283+
spec:
1284+
schedule: "0 2 * * *" # daily at 2am
1285+
retention: 720h # auto-cleanup backups older than 30 days
1286+
templateRef:
1287+
name: openstack-backup-templates
1288+
stages:
1289+
- name: galera-dumps
1290+
type: GaleraBackup
1291+
timeout: 10m
1292+
- name: pvc-backup
1293+
type: Template
1294+
templateRef:
1295+
name: openstack-backup-templates
1296+
key: backup-pvcs
1297+
completionCondition:
1298+
jsonpath: '{.status.phase}'
1299+
value: Completed
1300+
timeout: 30m
1301+
- name: resources-backup
1302+
type: Template
1303+
templateRef:
1304+
name: openstack-backup-templates
1305+
key: backup-resources
1306+
completionCondition:
1307+
jsonpath: '{.status.phase}'
1308+
value: Completed
1309+
timeout: 30m
1310+
```
1311+
1312+
The flow:
1313+
1314+
1. **Controller sees `schedule` field** → creates a CronJob
1315+
2. **CronJob fires on schedule** → creates a new `OpenStackBackup` CR
1316+
(e.g., `daily-backup-20260320-020000`) without a `schedule` field
1317+
3. **Controller sees new `OpenStackBackup` CR** → orchestrates the stages:
1318+
- Triggers Galera dump jobs (built-in `GaleraBackup` type)
1319+
- Renders backup templates from Secret → creates backup tool CRs
1320+
(e.g., Velero `Backup`) as `unstructured.Unstructured` objects
1321+
- Polls `completionCondition` on each created CR until done
1322+
4. **Controller updates status** on the `OpenStackBackup` CR
1323+
5. **Retention**: Controller garbage-collects `OpenStackBackup` CRs older
1324+
than `retention` period
1325+
1326+
This follows the same pattern as `GaleraBackup` (which also creates CronJobs
1327+
from a CR spec). Without a `schedule` field, the CR triggers a one-shot backup.
1328+
1329+
#### Design Principles
1330+
1331+
- **No backup tool imports**: Controller uses `unstructured.Unstructured` to
1332+
create CRs from templates — no Velero/OADP Go dependencies
1333+
- **Single Secret per workflow**: All stage templates in one Secret, referenced
1334+
by key (`templateRef.name` + `templateRef.key`)
1335+
- **Built-in stages**: `GaleraBackup` (trigger DB dump jobs), `GaleraRestore`
1336+
(create restore CRs, exec restore), `RabbitMQRestore` (credential restore)
1337+
are built-in since they use our own CRDs
1338+
- **Template variables**: Controller provides `.Namespace`, `.Timestamp`,
1339+
`.OADPNamespace`, `.BackupName`, etc. for template rendering
1340+
- **Configurable completion**: Each template stage has a `completionCondition`
1341+
(jsonpath + expected value) so the controller can poll any CR type
1342+
11651343
## Benefits
11661344

11671345
### Compared to Current Ansible Approach
@@ -1222,6 +1400,7 @@ This means:
12221400
1. **Restore Order Conflicts**: What if two CRDs have the same restore order?
12231401
- Currently: restored in parallel within the same Velero Restore CR (works fine for independent resources)
12241402

1225-
2. **Phase 4 Restore Controller**: Should the controller exec into pods for database restore or delegate to Jobs?
1403+
2. **Phase 4 Controllers**: Should database restore exec into pods or delegate to Jobs?
12261404
- Current Ansible approach: execs into GaleraRestore pods
12271405
- Controller approach: could create Jobs or use the same exec pattern
1406+
- Template Secret: should a default Secret be created by the operator, or provided by the user?

0 commit comments

Comments
 (0)