Skip to content

Commit 9a0a887

Browse files
jparrillclaude
andcommitted
docs: add HCPEtcdBackup implementation reference
Document the full HCPEtcdBackup integration including architecture, backup/restore flows, configuration, credential handling, storage layout, dependency chain (PRs #8139, #8010, #8017, #8040, #8114, enhancement #1945), and troubleshooting guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
1 parent a9afa54 commit 9a0a887

1 file changed

Lines changed: 362 additions & 0 deletions

File tree

Lines changed: 362 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,362 @@
1+
# HCPEtcdBackup Integration
2+
3+
## Table of Contents
4+
- [Overview](#overview)
5+
- [Architecture](#architecture)
6+
- [Backup Flow](#backup-flow)
7+
- [Restore Flow](#restore-flow)
8+
- [Configuration](#configuration)
9+
- [Storage Layout](#storage-layout)
10+
- [Credential Handling](#credential-handling)
11+
- [Implementation Details](#implementation-details)
12+
- [Dependencies](#dependencies)
13+
- [Troubleshooting](#troubleshooting)
14+
15+
## Overview
16+
17+
The HCPEtcdBackup integration adds an alternative etcd backup method to the OADP plugin. Instead of relying on CSI VolumeSnapshots or filesystem-level backups of etcd data volumes, it leverages the HyperShift Operator's `HCPEtcdBackup` controller to perform native etcd snapshots and upload them to object storage.
18+
19+
### Backup Methods
20+
21+
The plugin supports two mutually exclusive etcd backup methods, controlled by the `etcdBackupMethod` configuration key:
22+
23+
| Method | Value | Description |
24+
|---|---|---|
25+
| **Volume Snapshot** | `volumeSnapshot` (default) | Uses CSI VolumeSnapshots or FSBackup to capture etcd PVCs. This is the legacy behavior. |
26+
| **Etcd Snapshot** | `etcdSnapshot` | Creates an `HCPEtcdBackup` CR that triggers a native `etcdctl snapshot save`, then uploads the snapshot to the same object store used by Velero. |
27+
28+
### Key Benefits of etcdSnapshot
29+
30+
- Produces a portable, self-contained etcd snapshot (`.db` file)
31+
- No dependency on CSI drivers or storage-class-specific snapshot mechanisms
32+
- Snapshot is stored alongside the Velero backup data in the BSL
33+
- The snapshot URL is persisted in the HostedCluster status, surviving CR retention policies
34+
35+
## Architecture
36+
37+
### Components
38+
39+
```
40+
OADP Plugin (BackupPlugin)
41+
42+
┌──────────────┼──────────────┐
43+
│ │ │
44+
createEtcdBackup Execute() waitForCompletion
45+
│ │
46+
▼ ▼
47+
┌─────────────────┐ ┌──────────────────┐
48+
│ Orchestrator │ │ Poll Condition │
49+
│ - fetchBSL │ │ - VerifyInProgress│
50+
│ - mapBSLToStorage│ │ - WaitForCompletion│
51+
│ - copyCredSecret│ └──────────────────┘
52+
│ - Create CR │
53+
└────────┬──────────┘
54+
55+
56+
┌─────────────────────┐
57+
│ HCPEtcdBackup CR │ (in HCP namespace)
58+
└────────┬─────────────┘
59+
60+
61+
┌─────────────────────┐
62+
│ HyperShift Operator │ (HCPEtcdBackup controller)
63+
│ - etcdctl snapshot │
64+
│ - Upload to S3/Azure│
65+
│ - Update HC status │
66+
└──────────────────────┘
67+
```
68+
69+
### File Layout
70+
71+
| File | Purpose |
72+
|---|---|
73+
| `pkg/etcdbackup/orchestrator.go` | Core orchestration: BSL mapping, CR creation, polling, credential copy |
74+
| `pkg/core/backup.go` | Backup plugin: etcd backup method routing, pod/PVC exclusion |
75+
| `pkg/core/restore.go` | Restore plugin: snapshotURL injection into HostedCluster spec |
76+
| `pkg/common/types.go` | Shared constants for backup methods, annotations, volume names |
77+
| `pkg/common/scheme.go` | Scheme registration including apiextensionsv1 for CRD checks |
78+
79+
## Backup Flow
80+
81+
### Sequence
82+
83+
1. **Plugin initialization** (`NewBackupPlugin`): Reads `etcdBackupMethod` from the ConfigMap. Validates the value. Defaults to `volumeSnapshot`.
84+
85+
2. **HCP resolution**: On the first `Execute()` call, the plugin resolves the `HostedControlPlane` from the backup's included namespaces.
86+
87+
3. **HCPEtcdBackup creation** (etcdSnapshot only): Runs once, idempotent across all `Execute()` calls:
88+
- Checks that the `HCPEtcdBackup` CRD exists in the cluster (safenet)
89+
- Fetches the Velero `BackupStorageLocation` (BSL)
90+
- Maps BSL config to `HCPEtcdBackupStorage` (S3 or Azure Blob)
91+
- Copies the BSL credential Secret to the HO namespace, remapping the data key from `cloud` to `credentials`
92+
- Optionally sets encryption fields (KMS key ARN / Azure encryption key URL) from the HostedCluster spec
93+
- Creates the `HCPEtcdBackup` CR in the HCP namespace with a unique name (`oadp-{backup-name}-{random-4-chars}`)
94+
- Polls until the controller acknowledges the backup (InProgress or Succeeded)
95+
96+
4. **Wait for completion**: When the `HostedControlPlane` or `HostedCluster` item is processed, the plugin waits for the `HCPEtcdBackup` to reach a terminal state (succeeded or failed). Timeout: 10 minutes.
97+
98+
5. **Cleanup**: After completion, the copied credential Secret is deleted from the HO namespace.
99+
100+
6. **Pod exclusion**: Etcd pods are excluded from the backup entirely (`return nil, nil, nil`) to prevent CSI VolumeSnapshots or FSBackup of their volumes.
101+
102+
7. **PVC exclusion**: Etcd PVCs (names matching `data-etcd-*`) are excluded from the backup to prevent CSI snapshots.
103+
104+
### Ordering Independence
105+
106+
The `Execute()` method is called once per backed-up item, with no guaranteed ordering. The plugin handles this by:
107+
108+
- Creating the `HCPEtcdBackup` CR before the switch statement (after HCP resolution), so it runs regardless of which item arrives first
109+
- Making creation idempotent: if the orchestrator already created a CR, subsequent calls are no-ops
110+
- Calling `waitForEtcdBackupCompletion()` in both the HCP and HC cases — the wait is also idempotent (returns immediately after the first successful wait)
111+
112+
## Restore Flow
113+
114+
### Sequence
115+
116+
1. When the `HostedCluster` item is processed during restore, the plugin reads `status.lastSuccessfulEtcdBackupURL` from the HC's unstructured content.
117+
118+
2. If the URL is present and the HC has managed etcd (`spec.etcd.managed != nil`), the plugin injects the URL into `spec.etcd.managed.storage.restoreSnapshotURL`.
119+
120+
3. The modified HC is written back to Velero's output, so when the HC is created in the target cluster, the HyperShift Operator uses the snapshot URL to restore etcd from the snapshot.
121+
122+
### No Bidirectional Tracking
123+
124+
The previous approach required tracking both the `HCPEtcdBackup` CR and the `HostedCluster` item arrival order. With `lastSuccessfulEtcdBackupURL` persisted in the HC status by the HCPEtcdBackup controller, the restore flow is stateless — everything needed is in the HC object itself.
125+
126+
> **Note**: The `lastSuccessfulEtcdBackupURL` field is read via unstructured map access until the HyperShift API vendor is updated to include it (tracked in CNTRLPLANE-3173).
127+
128+
## Configuration
129+
130+
### Plugin ConfigMap
131+
132+
The plugin reads its configuration from a ConfigMap named `hypershift-oadp-plugin-config` in the OADP namespace (typically `openshift-adp`).
133+
134+
```yaml
135+
apiVersion: v1
136+
kind: ConfigMap
137+
metadata:
138+
name: hypershift-oadp-plugin-config
139+
namespace: openshift-adp
140+
data:
141+
etcdBackupMethod: "etcdSnapshot" # or "volumeSnapshot" (default)
142+
hoNamespace: "hypershift" # HyperShift Operator namespace (default)
143+
migration: "true" # Enable migration mode (optional)
144+
```
145+
146+
### Configuration Keys
147+
148+
| Key | Values | Default | Description |
149+
|---|---|---|---|
150+
| `etcdBackupMethod` | `volumeSnapshot`, `etcdSnapshot` | `volumeSnapshot` | Controls which etcd backup strategy is used |
151+
| `hoNamespace` | any namespace name | `hypershift` | Namespace where the HyperShift Operator runs |
152+
| `migration` | `true`, `false` | `false` | Enables migration-specific behavior (e.g., Agent platform PreserveOnDelete) |
153+
154+
### Generating the ConfigMap
155+
156+
A helper script is available at the project's documentation directory:
157+
158+
```bash
159+
# Set etcdSnapshot method
160+
./generate-plugin-config.sh -e etcdSnapshot
161+
162+
# Dry-run to review
163+
./generate-plugin-config.sh -e etcdSnapshot -d
164+
165+
# Override all defaults
166+
./generate-plugin-config.sh -n my-adp-ns -e etcdSnapshot -o my-ho-ns -m true
167+
```
168+
169+
### Backup Manifest
170+
171+
When using `etcdSnapshot`, the Velero Backup manifest should disable volume-level backups since etcd data is handled by the HCPEtcdBackup controller:
172+
173+
```yaml
174+
apiVersion: velero.io/v1
175+
kind: Backup
176+
metadata:
177+
name: hcp-aws-backup
178+
namespace: openshift-adp
179+
spec:
180+
storageLocation: default
181+
includedNamespaces:
182+
- clusters
183+
- clusters-<hosted-cluster-name>
184+
includedResources:
185+
- sa
186+
- role
187+
- rolebinding
188+
- pod
189+
- pvc
190+
- pv
191+
- configmap
192+
- secrets
193+
- services
194+
- deployments
195+
- statefulsets
196+
- hostedcluster
197+
- nodepool
198+
- hostedcontrolplane
199+
- cluster
200+
- awscluster
201+
- awsmachinetemplate
202+
- awsmachine
203+
- machinedeployment
204+
- machineset
205+
- machine
206+
- route
207+
- clusterdeployment
208+
- namespace
209+
snapshotMoveData: false
210+
defaultVolumesToFsBackup: false
211+
snapshotVolumes: false
212+
```
213+
214+
## Storage Layout
215+
216+
The etcd snapshot is stored alongside the Velero backup data in the BSL, following Velero's directory convention:
217+
218+
```
219+
s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db
220+
```
221+
222+
Example:
223+
```
224+
s3://my-oadp-bucket/backup-objects/backups/hcp-aws-backup/etcd-backup/1775575637.db
225+
```
226+
227+
This ensures:
228+
- The snapshot is co-located with the rest of the backup
229+
- Velero does not flag `etcd-backup` as an invalid top-level directory (which would make the BSL unavailable)
230+
- Backup retention policies applied to the Velero backup directory also cover the etcd snapshot
231+
232+
## Credential Handling
233+
234+
### BSL to HCPEtcdBackup Credential Flow
235+
236+
The HCPEtcdBackup controller needs credentials to upload the snapshot to object storage. The OADP plugin bridges the gap between Velero's BSL credentials and the controller's expectations:
237+
238+
1. **Source**: The BSL references a Secret via `spec.credential` (a `SecretKeySelector` with `name` and `key`, typically key = `cloud`)
239+
240+
2. **Copy**: The plugin copies the credential data to a new Secret in the HO namespace with:
241+
- Name: `etcd-backup-creds-<backup-name>`
242+
- Label: `hypershift.openshift.io/etcd-backup: "true"`
243+
- Key remapping: BSL key (e.g., `cloud`) is remapped to `credentials` (expected by the controller)
244+
245+
3. **Reuse**: If the destination Secret already exists, it is reused (STS credentials contain an IAM Role ARN that does not rotate)
246+
247+
4. **Cleanup**: After the backup completes (or fails), the copied Secret is deleted
248+
249+
### Key Remapping
250+
251+
The controller mounts the credential Secret as a volume at `/etc/etcd-backup-creds/` and reads the file `credentials`. Velero BSL Secrets typically store credentials under the key `cloud`. The plugin extracts only the referenced key and writes it as `credentials` in the destination Secret.
252+
253+
## Implementation Details
254+
255+
### CRD Existence Check
256+
257+
Before creating an `HCPEtcdBackup` CR, the plugin verifies that the CRD exists in the cluster. This is a safenet — if `etcdBackupMethod` is `etcdSnapshot` but the CRD is missing, the backup fails with a clear error rather than silently falling back.
258+
259+
The check requires `apiextensionsv1` to be registered in the client scheme (`pkg/common/scheme.go`).
260+
261+
### Polling
262+
263+
The plugin uses `wait.PollUntilContextTimeout` from `k8s.io/apimachinery/pkg/util/wait` to poll the `HCPEtcdBackup` status:
264+
265+
- **VerifyInProgress**: 30-second timeout, 5-second interval. Checks that the controller acknowledged the backup.
266+
- **WaitForCompletion**: 10-minute timeout, 5-second interval. Waits for terminal state (succeeded or failed).
267+
268+
Both check the `BackupCompleted` condition on the CR.
269+
270+
### Unique CR Naming
271+
272+
Each backup creates an `HCPEtcdBackup` CR with a unique name: `oadp-<backup-name>-<4-char-random-suffix>`. This uses `k8s.io/apimachinery/pkg/util/rand.String(4)` and prevents collisions with previous backup runs.
273+
274+
## Dependencies
275+
276+
### HyperShift PRs
277+
278+
This feature depends on changes in the openshift/hypershift repository:
279+
280+
| PR | Description | Jira | Status |
281+
|---|---|---|---|
282+
| [#8139](https://github.com/openshift/hypershift/pull/8139) | HCPEtcdBackup controller | [CNTRLPLANE-2678](https://issues.redhat.com/browse/CNTRLPLANE-2678) | Pending merge |
283+
| CNTRLPLANE-3173 | `LastSuccessfulEtcdBackupURL` field in HostedClusterStatus | [CNTRLPLANE-3173](https://issues.redhat.com/browse/CNTRLPLANE-3173) | Pending merge |
284+
285+
#### PR #8139 Dependency Chain (all merged)
286+
287+
The HCPEtcdBackup controller (PR #8139) depends on the following merged PRs:
288+
289+
| PR | Description |
290+
|---|---|
291+
| [#8010](https://github.com/openshift/hypershift/pull/8010) | `fetch-etcd-certs` CPO subcommand |
292+
| [#8017](https://github.com/openshift/hypershift/pull/8017) | `etcd-upload` CPO subcommand |
293+
| [#8040](https://github.com/openshift/hypershift/pull/8040) | `etcd-backup` CPO subcommand |
294+
| [#8114](https://github.com/openshift/hypershift/pull/8114) | Transfer Manager upgrade |
295+
296+
### OADP Plugin PR
297+
298+
| PR | Description | Jira |
299+
|---|---|---|
300+
| This PR | Integrate HCPEtcdBackup lifecycle into OADP backup/restore flow | [CNTRLPLANE-2685](https://issues.redhat.com/browse/CNTRLPLANE-2685) |
301+
302+
### Enhancement
303+
304+
The overall design is defined in [Enhancement PR #1945](https://github.com/openshift/enhancements/pull/1945).
305+
306+
### Post-Merge Vendor Update
307+
308+
Once both HyperShift PRs are merged, the vendor must be updated to:
309+
310+
1. Replace `getLastSuccessfulEtcdBackupURL()` unstructured helper in `pkg/core/restore.go` with direct field access: `hc.Status.LastSuccessfulEtcdBackupURL`
311+
2. Remove local constants (`BackupInProgressReason`, `BackupRejectedReason`, `EtcdBackupSucceeded`) in `pkg/common/types.go` in favor of the API-defined constants
312+
313+
## Troubleshooting
314+
315+
### BSL Unavailable After Backup
316+
317+
**Symptom**: `BackupStorageLocation "default" is unavailable: Backup store contains invalid top-level directories`
318+
319+
**Cause**: An older version of the plugin stored the etcd snapshot at `<prefix>/etcd-backup/` instead of inside the backup directory.
320+
321+
**Fix**: Delete the orphaned directory from the bucket:
322+
```bash
323+
aws s3 rm s3://<bucket>/<prefix>/etcd-backup/ --recursive
324+
```
325+
326+
### Credential Errors (IMDS / No Credentials Found)
327+
328+
**Symptom**: The etcd backup Job fails with `no EC2 IMDS role found` or similar credential errors.
329+
330+
**Cause**: The credential Secret was not remapped correctly, or an old Secret (with key `cloud` instead of `credentials`) is being reused.
331+
332+
**Fix**: Delete the stale credential Secret and retry:
333+
```bash
334+
oc delete secret -n hypershift -l hypershift.openshift.io/etcd-backup=true
335+
```
336+
337+
### HCPEtcdBackup CRD Not Found
338+
339+
**Symptom**: `etcdBackupMethod is "etcdSnapshot" but HCPEtcdBackup CRD not found in the cluster`
340+
341+
**Cause**: The HyperShift Operator does not have the HCPEtcdBackup controller enabled (requires feature gate `HCPEtcdBackup`).
342+
343+
**Fix**: Enable the feature gate on the HyperShift Operator or switch to `volumeSnapshot` method.
344+
345+
### Backup Reuses Old HCPEtcdBackup
346+
347+
**Symptom**: The backup completes instantly without creating a new etcd snapshot, reusing a previous `snapshotURL`.
348+
349+
**Cause**: An old `HCPEtcdBackup` CR with a completed status still exists in the HCP namespace. Since v8, CR names include a random suffix to prevent this.
350+
351+
**Fix**: Delete old CRs before running a new backup:
352+
```bash
353+
oc delete hcpetcdbackups --all -n <hcp-namespace>
354+
```
355+
356+
### Unknown Configuration Key Warning
357+
358+
**Symptom**: Velero logs show `unknown configuration key: etcdBackupMethod with value etcdSnapshot`
359+
360+
**Cause**: The plugin validator does not recognize the key. This was fixed to treat `etcdBackupMethod` and `hoNamespace` as known keys handled during plugin initialization.
361+
362+
**Fix**: Ensure you are running an updated plugin image that includes this fix.

0 commit comments

Comments
 (0)