|
| 1 | +# HCPEtcdBackup Integration |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | +- [Overview](#overview) |
| 5 | +- [Architecture](#architecture) |
| 6 | +- [Backup Flow](#backup-flow) |
| 7 | +- [Restore Flow](#restore-flow) |
| 8 | +- [Configuration](#configuration) |
| 9 | +- [Storage Layout](#storage-layout) |
| 10 | +- [Credential Handling](#credential-handling) |
| 11 | +- [Implementation Details](#implementation-details) |
| 12 | +- [Dependencies](#dependencies) |
| 13 | +- [Troubleshooting](#troubleshooting) |
| 14 | + |
| 15 | +## Overview |
| 16 | + |
| 17 | +The HCPEtcdBackup integration adds an alternative etcd backup method to the OADP plugin. Instead of relying on CSI VolumeSnapshots or filesystem-level backups of etcd data volumes, it leverages the HyperShift Operator's `HCPEtcdBackup` controller to perform native etcd snapshots and upload them to object storage. |
| 18 | + |
| 19 | +### Backup Methods |
| 20 | + |
| 21 | +The plugin supports two mutually exclusive etcd backup methods, controlled by the `etcdBackupMethod` configuration key: |
| 22 | + |
| 23 | +| Method | Value | Description | |
| 24 | +|---|---|---| |
| 25 | +| **Volume Snapshot** | `volumeSnapshot` (default) | Uses CSI VolumeSnapshots or FSBackup to capture etcd PVCs. This is the legacy behavior. | |
| 26 | +| **Etcd Snapshot** | `etcdSnapshot` | Creates an `HCPEtcdBackup` CR that triggers a native `etcdctl snapshot save`, then uploads the snapshot to the same object store used by Velero. | |
| 27 | + |
| 28 | +### Key Benefits of etcdSnapshot |
| 29 | + |
| 30 | +- Produces a portable, self-contained etcd snapshot (`.db` file) |
| 31 | +- No dependency on CSI drivers or storage-class-specific snapshot mechanisms |
| 32 | +- Snapshot is stored alongside the Velero backup data in the BSL |
| 33 | +- The snapshot URL is persisted in the HostedCluster status, surviving CR retention policies |
| 34 | + |
| 35 | +## Architecture |
| 36 | + |
| 37 | +### Components |
| 38 | + |
| 39 | +``` |
| 40 | + OADP Plugin (BackupPlugin) |
| 41 | + │ |
| 42 | + ┌──────────────┼──────────────┐ |
| 43 | + │ │ │ |
| 44 | + createEtcdBackup Execute() waitForCompletion |
| 45 | + │ │ |
| 46 | + ▼ ▼ |
| 47 | + ┌─────────────────┐ ┌──────────────────┐ |
| 48 | + │ Orchestrator │ │ Poll Condition │ |
| 49 | + │ - fetchBSL │ │ - VerifyInProgress│ |
| 50 | + │ - mapBSLToStorage│ │ - WaitForCompletion│ |
| 51 | + │ - copyCredSecret│ └──────────────────┘ |
| 52 | + │ - Create CR │ |
| 53 | + └────────┬──────────┘ |
| 54 | + │ |
| 55 | + ▼ |
| 56 | + ┌─────────────────────┐ |
| 57 | + │ HCPEtcdBackup CR │ (in HCP namespace) |
| 58 | + └────────┬─────────────┘ |
| 59 | + │ |
| 60 | + ▼ |
| 61 | + ┌─────────────────────┐ |
| 62 | + │ HyperShift Operator │ (HCPEtcdBackup controller) |
| 63 | + │ - etcdctl snapshot │ |
| 64 | + │ - Upload to S3/Azure│ |
| 65 | + │ - Update HC status │ |
| 66 | + └──────────────────────┘ |
| 67 | +``` |
| 68 | + |
| 69 | +### File Layout |
| 70 | + |
| 71 | +| File | Purpose | |
| 72 | +|---|---| |
| 73 | +| `pkg/etcdbackup/orchestrator.go` | Core orchestration: BSL mapping, CR creation, polling, credential copy | |
| 74 | +| `pkg/core/backup.go` | Backup plugin: etcd backup method routing, pod/PVC exclusion | |
| 75 | +| `pkg/core/restore.go` | Restore plugin: snapshotURL injection into HostedCluster spec | |
| 76 | +| `pkg/common/types.go` | Shared constants for backup methods, annotations, volume names | |
| 77 | +| `pkg/common/scheme.go` | Scheme registration including apiextensionsv1 for CRD checks | |
| 78 | + |
| 79 | +## Backup Flow |
| 80 | + |
| 81 | +### Sequence |
| 82 | + |
| 83 | +1. **Plugin initialization** (`NewBackupPlugin`): Reads `etcdBackupMethod` from the ConfigMap. Validates the value. Defaults to `volumeSnapshot`. |
| 84 | + |
| 85 | +2. **HCP resolution**: On the first `Execute()` call, the plugin resolves the `HostedControlPlane` from the backup's included namespaces. |
| 86 | + |
| 87 | +3. **HCPEtcdBackup creation** (etcdSnapshot only): Runs once, idempotent across all `Execute()` calls: |
| 88 | + - Checks that the `HCPEtcdBackup` CRD exists in the cluster (safenet) |
| 89 | + - Fetches the Velero `BackupStorageLocation` (BSL) |
| 90 | + - Maps BSL config to `HCPEtcdBackupStorage` (S3 or Azure Blob) |
| 91 | + - Copies the BSL credential Secret to the HO namespace, remapping the data key from `cloud` to `credentials` |
| 92 | + - Optionally sets encryption fields (KMS key ARN / Azure encryption key URL) from the HostedCluster spec |
| 93 | + - Creates the `HCPEtcdBackup` CR in the HCP namespace with a unique name (`oadp-{backup-name}-{random-4-chars}`) |
| 94 | + - Polls until the controller acknowledges the backup (InProgress or Succeeded) |
| 95 | + |
| 96 | +4. **Wait for completion**: When the `HostedControlPlane` or `HostedCluster` item is processed, the plugin waits for the `HCPEtcdBackup` to reach a terminal state (succeeded or failed). Timeout: 10 minutes. |
| 97 | + |
| 98 | +5. **Cleanup**: After completion, the copied credential Secret is deleted from the HO namespace. |
| 99 | + |
| 100 | +6. **Pod exclusion**: Etcd pods are excluded from the backup entirely (`return nil, nil, nil`) to prevent CSI VolumeSnapshots or FSBackup of their volumes. |
| 101 | + |
| 102 | +7. **PVC exclusion**: Etcd PVCs (names matching `data-etcd-*`) are excluded from the backup to prevent CSI snapshots. |
| 103 | + |
| 104 | +### Ordering Independence |
| 105 | + |
| 106 | +The `Execute()` method is called once per backed-up item, with no guaranteed ordering. The plugin handles this by: |
| 107 | + |
| 108 | +- Creating the `HCPEtcdBackup` CR before the switch statement (after HCP resolution), so it runs regardless of which item arrives first |
| 109 | +- Making creation idempotent: if the orchestrator already created a CR, subsequent calls are no-ops |
| 110 | +- Calling `waitForEtcdBackupCompletion()` in both the HCP and HC cases — the wait is also idempotent (returns immediately after the first successful wait) |
| 111 | + |
| 112 | +## Restore Flow |
| 113 | + |
| 114 | +### Sequence |
| 115 | + |
| 116 | +1. When the `HostedCluster` item is processed during restore, the plugin reads `status.lastSuccessfulEtcdBackupURL` from the HC's unstructured content. |
| 117 | + |
| 118 | +2. If the URL is present and the HC has managed etcd (`spec.etcd.managed != nil`), the plugin injects the URL into `spec.etcd.managed.storage.restoreSnapshotURL`. |
| 119 | + |
| 120 | +3. The modified HC is written back to Velero's output, so when the HC is created in the target cluster, the HyperShift Operator uses the snapshot URL to restore etcd from the snapshot. |
| 121 | + |
| 122 | +### No Bidirectional Tracking |
| 123 | + |
| 124 | +The previous approach required tracking both the `HCPEtcdBackup` CR and the `HostedCluster` item arrival order. With `lastSuccessfulEtcdBackupURL` persisted in the HC status by the HCPEtcdBackup controller, the restore flow is stateless — everything needed is in the HC object itself. |
| 125 | + |
| 126 | +> **Note**: The `lastSuccessfulEtcdBackupURL` field is read via unstructured map access until the HyperShift API vendor is updated to include it (tracked in CNTRLPLANE-3173). |
| 127 | +
|
| 128 | +## Configuration |
| 129 | + |
| 130 | +### Plugin ConfigMap |
| 131 | + |
| 132 | +The plugin reads its configuration from a ConfigMap named `hypershift-oadp-plugin-config` in the OADP namespace (typically `openshift-adp`). |
| 133 | + |
| 134 | +```yaml |
| 135 | +apiVersion: v1 |
| 136 | +kind: ConfigMap |
| 137 | +metadata: |
| 138 | + name: hypershift-oadp-plugin-config |
| 139 | + namespace: openshift-adp |
| 140 | +data: |
| 141 | + etcdBackupMethod: "etcdSnapshot" # or "volumeSnapshot" (default) |
| 142 | + hoNamespace: "hypershift" # HyperShift Operator namespace (default) |
| 143 | + migration: "true" # Enable migration mode (optional) |
| 144 | +``` |
| 145 | +
|
| 146 | +### Configuration Keys |
| 147 | +
|
| 148 | +| Key | Values | Default | Description | |
| 149 | +|---|---|---|---| |
| 150 | +| `etcdBackupMethod` | `volumeSnapshot`, `etcdSnapshot` | `volumeSnapshot` | Controls which etcd backup strategy is used | |
| 151 | +| `hoNamespace` | any namespace name | `hypershift` | Namespace where the HyperShift Operator runs | |
| 152 | +| `migration` | `true`, `false` | `false` | Enables migration-specific behavior (e.g., Agent platform PreserveOnDelete) | |
| 153 | + |
| 154 | +### Generating the ConfigMap |
| 155 | + |
| 156 | +A helper script is available at the project's documentation directory: |
| 157 | + |
| 158 | +```bash |
| 159 | +# Set etcdSnapshot method |
| 160 | +./generate-plugin-config.sh -e etcdSnapshot |
| 161 | +
|
| 162 | +# Dry-run to review |
| 163 | +./generate-plugin-config.sh -e etcdSnapshot -d |
| 164 | +
|
| 165 | +# Override all defaults |
| 166 | +./generate-plugin-config.sh -n my-adp-ns -e etcdSnapshot -o my-ho-ns -m true |
| 167 | +``` |
| 168 | + |
| 169 | +### Backup Manifest |
| 170 | + |
| 171 | +When using `etcdSnapshot`, the Velero Backup manifest should disable volume-level backups since etcd data is handled by the HCPEtcdBackup controller: |
| 172 | + |
| 173 | +```yaml |
| 174 | +apiVersion: velero.io/v1 |
| 175 | +kind: Backup |
| 176 | +metadata: |
| 177 | + name: hcp-aws-backup |
| 178 | + namespace: openshift-adp |
| 179 | +spec: |
| 180 | + storageLocation: default |
| 181 | + includedNamespaces: |
| 182 | + - clusters |
| 183 | + - clusters-<hosted-cluster-name> |
| 184 | + includedResources: |
| 185 | + - sa |
| 186 | + - role |
| 187 | + - rolebinding |
| 188 | + - pod |
| 189 | + - pvc |
| 190 | + - pv |
| 191 | + - configmap |
| 192 | + - secrets |
| 193 | + - services |
| 194 | + - deployments |
| 195 | + - statefulsets |
| 196 | + - hostedcluster |
| 197 | + - nodepool |
| 198 | + - hostedcontrolplane |
| 199 | + - cluster |
| 200 | + - awscluster |
| 201 | + - awsmachinetemplate |
| 202 | + - awsmachine |
| 203 | + - machinedeployment |
| 204 | + - machineset |
| 205 | + - machine |
| 206 | + - route |
| 207 | + - clusterdeployment |
| 208 | + - namespace |
| 209 | + snapshotMoveData: false |
| 210 | + defaultVolumesToFsBackup: false |
| 211 | + snapshotVolumes: false |
| 212 | +``` |
| 213 | + |
| 214 | +## Storage Layout |
| 215 | + |
| 216 | +The etcd snapshot is stored alongside the Velero backup data in the BSL, following Velero's directory convention: |
| 217 | + |
| 218 | +``` |
| 219 | +s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db |
| 220 | +``` |
| 221 | + |
| 222 | +Example: |
| 223 | +``` |
| 224 | +s3://my-oadp-bucket/backup-objects/backups/hcp-aws-backup/etcd-backup/1775575637.db |
| 225 | +``` |
| 226 | + |
| 227 | +This ensures: |
| 228 | +- The snapshot is co-located with the rest of the backup |
| 229 | +- Velero does not flag `etcd-backup` as an invalid top-level directory (which would make the BSL unavailable) |
| 230 | +- Backup retention policies applied to the Velero backup directory also cover the etcd snapshot |
| 231 | + |
| 232 | +## Credential Handling |
| 233 | + |
| 234 | +### BSL to HCPEtcdBackup Credential Flow |
| 235 | + |
| 236 | +The HCPEtcdBackup controller needs credentials to upload the snapshot to object storage. The OADP plugin bridges the gap between Velero's BSL credentials and the controller's expectations: |
| 237 | + |
| 238 | +1. **Source**: The BSL references a Secret via `spec.credential` (a `SecretKeySelector` with `name` and `key`, typically key = `cloud`) |
| 239 | + |
| 240 | +2. **Copy**: The plugin copies the credential data to a new Secret in the HO namespace with: |
| 241 | + - Name: `etcd-backup-creds-<backup-name>` |
| 242 | + - Label: `hypershift.openshift.io/etcd-backup: "true"` |
| 243 | + - Key remapping: BSL key (e.g., `cloud`) is remapped to `credentials` (expected by the controller) |
| 244 | + |
| 245 | +3. **Reuse**: If the destination Secret already exists, it is reused (STS credentials contain an IAM Role ARN that does not rotate) |
| 246 | + |
| 247 | +4. **Cleanup**: After the backup completes (or fails), the copied Secret is deleted |
| 248 | + |
| 249 | +### Key Remapping |
| 250 | + |
| 251 | +The controller mounts the credential Secret as a volume at `/etc/etcd-backup-creds/` and reads the file `credentials`. Velero BSL Secrets typically store credentials under the key `cloud`. The plugin extracts only the referenced key and writes it as `credentials` in the destination Secret. |
| 252 | + |
| 253 | +## Implementation Details |
| 254 | + |
| 255 | +### CRD Existence Check |
| 256 | + |
| 257 | +Before creating an `HCPEtcdBackup` CR, the plugin verifies that the CRD exists in the cluster. This is a safenet — if `etcdBackupMethod` is `etcdSnapshot` but the CRD is missing, the backup fails with a clear error rather than silently falling back. |
| 258 | + |
| 259 | +The check requires `apiextensionsv1` to be registered in the client scheme (`pkg/common/scheme.go`). |
| 260 | + |
| 261 | +### Polling |
| 262 | + |
| 263 | +The plugin uses `wait.PollUntilContextTimeout` from `k8s.io/apimachinery/pkg/util/wait` to poll the `HCPEtcdBackup` status: |
| 264 | + |
| 265 | +- **VerifyInProgress**: 30-second timeout, 5-second interval. Checks that the controller acknowledged the backup. |
| 266 | +- **WaitForCompletion**: 10-minute timeout, 5-second interval. Waits for terminal state (succeeded or failed). |
| 267 | + |
| 268 | +Both check the `BackupCompleted` condition on the CR. |
| 269 | + |
| 270 | +### Unique CR Naming |
| 271 | + |
| 272 | +Each backup creates an `HCPEtcdBackup` CR with a unique name: `oadp-<backup-name>-<4-char-random-suffix>`. This uses `k8s.io/apimachinery/pkg/util/rand.String(4)` and prevents collisions with previous backup runs. |
| 273 | + |
| 274 | +## Dependencies |
| 275 | + |
| 276 | +### HyperShift PRs |
| 277 | + |
| 278 | +This feature depends on changes in the openshift/hypershift repository: |
| 279 | + |
| 280 | +| PR | Description | Jira | Status | |
| 281 | +|---|---|---|---| |
| 282 | +| [#8139](https://github.com/openshift/hypershift/pull/8139) | HCPEtcdBackup controller | [CNTRLPLANE-2678](https://issues.redhat.com/browse/CNTRLPLANE-2678) | Pending merge | |
| 283 | +| CNTRLPLANE-3173 | `LastSuccessfulEtcdBackupURL` field in HostedClusterStatus | [CNTRLPLANE-3173](https://issues.redhat.com/browse/CNTRLPLANE-3173) | Pending merge | |
| 284 | + |
| 285 | +#### PR #8139 Dependency Chain (all merged) |
| 286 | + |
| 287 | +The HCPEtcdBackup controller (PR #8139) depends on the following merged PRs: |
| 288 | + |
| 289 | +| PR | Description | |
| 290 | +|---|---| |
| 291 | +| [#8010](https://github.com/openshift/hypershift/pull/8010) | `fetch-etcd-certs` CPO subcommand | |
| 292 | +| [#8017](https://github.com/openshift/hypershift/pull/8017) | `etcd-upload` CPO subcommand | |
| 293 | +| [#8040](https://github.com/openshift/hypershift/pull/8040) | `etcd-backup` CPO subcommand | |
| 294 | +| [#8114](https://github.com/openshift/hypershift/pull/8114) | Transfer Manager upgrade | |
| 295 | + |
| 296 | +### OADP Plugin PR |
| 297 | + |
| 298 | +| PR | Description | Jira | |
| 299 | +|---|---|---| |
| 300 | +| This PR | Integrate HCPEtcdBackup lifecycle into OADP backup/restore flow | [CNTRLPLANE-2685](https://issues.redhat.com/browse/CNTRLPLANE-2685) | |
| 301 | + |
| 302 | +### Enhancement |
| 303 | + |
| 304 | +The overall design is defined in [Enhancement PR #1945](https://github.com/openshift/enhancements/pull/1945). |
| 305 | + |
| 306 | +### Post-Merge Vendor Update |
| 307 | + |
| 308 | +Once both HyperShift PRs are merged, the vendor must be updated to: |
| 309 | + |
| 310 | +1. Replace `getLastSuccessfulEtcdBackupURL()` unstructured helper in `pkg/core/restore.go` with direct field access: `hc.Status.LastSuccessfulEtcdBackupURL` |
| 311 | +2. Remove local constants (`BackupInProgressReason`, `BackupRejectedReason`, `EtcdBackupSucceeded`) in `pkg/common/types.go` in favor of the API-defined constants |
| 312 | + |
| 313 | +## Troubleshooting |
| 314 | + |
| 315 | +### BSL Unavailable After Backup |
| 316 | + |
| 317 | +**Symptom**: `BackupStorageLocation "default" is unavailable: Backup store contains invalid top-level directories` |
| 318 | + |
| 319 | +**Cause**: An older version of the plugin stored the etcd snapshot at `<prefix>/etcd-backup/` instead of inside the backup directory. |
| 320 | + |
| 321 | +**Fix**: Delete the orphaned directory from the bucket: |
| 322 | +```bash |
| 323 | +aws s3 rm s3://<bucket>/<prefix>/etcd-backup/ --recursive |
| 324 | +``` |
| 325 | + |
| 326 | +### Credential Errors (IMDS / No Credentials Found) |
| 327 | + |
| 328 | +**Symptom**: The etcd backup Job fails with `no EC2 IMDS role found` or similar credential errors. |
| 329 | + |
| 330 | +**Cause**: The credential Secret was not remapped correctly, or an old Secret (with key `cloud` instead of `credentials`) is being reused. |
| 331 | + |
| 332 | +**Fix**: Delete the stale credential Secret and retry: |
| 333 | +```bash |
| 334 | +oc delete secret -n hypershift -l hypershift.openshift.io/etcd-backup=true |
| 335 | +``` |
| 336 | + |
| 337 | +### HCPEtcdBackup CRD Not Found |
| 338 | + |
| 339 | +**Symptom**: `etcdBackupMethod is "etcdSnapshot" but HCPEtcdBackup CRD not found in the cluster` |
| 340 | + |
| 341 | +**Cause**: The HyperShift Operator does not have the HCPEtcdBackup controller enabled (requires feature gate `HCPEtcdBackup`). |
| 342 | + |
| 343 | +**Fix**: Enable the feature gate on the HyperShift Operator or switch to `volumeSnapshot` method. |
| 344 | + |
| 345 | +### Backup Reuses Old HCPEtcdBackup |
| 346 | + |
| 347 | +**Symptom**: The backup completes instantly without creating a new etcd snapshot, reusing a previous `snapshotURL`. |
| 348 | + |
| 349 | +**Cause**: An old `HCPEtcdBackup` CR with a completed status still exists in the HCP namespace. Since v8, CR names include a random suffix to prevent this. |
| 350 | + |
| 351 | +**Fix**: Delete old CRs before running a new backup: |
| 352 | +```bash |
| 353 | +oc delete hcpetcdbackups --all -n <hcp-namespace> |
| 354 | +``` |
| 355 | + |
| 356 | +### Unknown Configuration Key Warning |
| 357 | + |
| 358 | +**Symptom**: Velero logs show `unknown configuration key: etcdBackupMethod with value etcdSnapshot` |
| 359 | + |
| 360 | +**Cause**: The plugin validator does not recognize the key. This was fixed to treat `etcdBackupMethod` and `hoNamespace` as known keys handled during plugin initialization. |
| 361 | + |
| 362 | +**Fix**: Ensure you are running an updated plugin image that includes this fix. |
0 commit comments