Skip to content

Commit 46c5cbf

Browse files
stuggiclaude
andcommitted
Move troubleshooting to separate doc and update README
Split troubleshooting content into its own document and updated README to reference both the troubleshooting doc and experimental scenarios. Changes: - Created backup-restore-ctlplane-troubleshooting.md (263 lines) - Moved all troubleshooting content from main doc - Added troubleshooting doc to README table - Added experimental scenarios section to README under Future Enhancements - Main doc reduced from 1,908 to 1,658 lines Document structure now: - backup-restore-ctlplane.md: Main tested procedure (1,658 lines) - backup-restore-ctlplane-experimental.md: Untested scenarios (499 lines) - backup-restore-ctlplane-troubleshooting.md: Common issues (263 lines) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 8e3d58b commit 46c5cbf

3 files changed

Lines changed: 275 additions & 251 deletions

File tree

docs/dev/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ For a complete OpenStack backup and restore:
1515
|----------|-------------|
1616
| [backup-restore-ctlplane.md](backup-restore-ctlplane.md) | **ControlPlane** backup/restore - OpenStackControlPlane CR, secrets, configmaps |
1717
| [backup-restore-dataplane.md](backup-restore-dataplane.md) | **DataPlane** backup/restore - Compute nodes (NodeSets), network configuration (NetConfig), IP allocations |
18+
| [backup-restore-ctlplane-troubleshooting.md](backup-restore-ctlplane-troubleshooting.md) | **Troubleshooting** - Common issues and solutions for backup/restore |
1819

1920
## Ansible Playbooks
2021

@@ -104,6 +105,16 @@ The current DataPlane backup/restore procedure is designed for **NodeSets with `
104105

105106
The following features are under consideration for future implementation:
106107

108+
### Experimental Restore Scenarios (Not Tested)
109+
110+
Additional restore scenarios have been documented but **NOT tested**:
111+
112+
| Document | Description |
113+
|----------|-------------|
114+
| [backup-restore-ctlplane-experimental.md](backup-restore-ctlplane-experimental.md) | **Experimental** - Scenario 2 (Different Namespace) and Scenario 3 (Different Cluster) restore procedures. ⚠️ Use at your own risk. |
115+
116+
These scenarios are theoretically possible but require additional testing and validation before production use.
117+
107118
### Backup/Restore During Partial Updates
108119

109120
**Current Limitation:**
Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
# OpenStack Control Plane Backup and Restore - Troubleshooting
2+
3+
This document contains troubleshooting guidance for the OpenStack Control Plane backup and restore procedures.
4+
5+
For the main backup/restore documentation, see [backup-restore-ctlplane.md](backup-restore-ctlplane.md).
6+
7+
---
8+
9+
## Troubleshooting
10+
11+
### Issue: Operator Version Mismatch
12+
13+
**Symptoms:**
14+
- Control plane CR fails to reconcile
15+
- Error messages about unknown fields or invalid schema
16+
- CRDs rejected during restore
17+
- Operator logs show validation errors
18+
19+
**Diagnosis:**
20+
21+
```bash
22+
# Compare operator versions
23+
cat operator-versions.txt # From backup
24+
25+
# vs current versions
26+
oc get deployment openstack-operator-controller-manager -n openstack-operators -o jsonpath='{.spec.template.spec.containers[0].image}'
27+
28+
# Check for CRD version differences
29+
oc get crd openstackcontrolplanes.core.openstack.org -o jsonpath='{.spec.versions[*].name}'
30+
```
31+
32+
**Solution:**
33+
34+
```bash
35+
# Option 1: Install matching operator version on target cluster (RECOMMENDED)
36+
# Follow your operator installation procedure to install the specific version
37+
# shown in operator-versions.txt from the backup
38+
39+
# Option 2: If source cluster is still available, upgrade operators then re-backup
40+
# (Only if target has newer operators and you want to move to newer version)
41+
42+
# DO NOT attempt to:
43+
# - Manually edit CRs to match new schema (likely to fail)
44+
# - Force apply with --force flag (will cause data corruption)
45+
# - Mix operator versions (openstack-operator vs infra-operator)
46+
```
47+
48+
**Prevention:**
49+
- Always document operator versions during backup
50+
- Test restores in non-production environment first
51+
- Maintain operator version parity across clusters used for DR
52+
53+
### Issue: RabbitMQ Authentication Failures
54+
55+
**Symptoms:**
56+
- Services fail to start or restart repeatedly
57+
- Error logs show "ACCESS_REFUSED" or authentication failures
58+
- TransportURL CRs show errors
59+
- Service logs contain RabbitMQ connection errors
60+
61+
**Diagnosis:**
62+
63+
```bash
64+
# Check service logs for RabbitMQ auth errors
65+
oc logs -n openstack deployment/nova-api | grep -i rabbit
66+
oc logs -n openstack deployment/neutron-api | grep -i rabbit
67+
68+
# Verify RabbitMQ user exists (should match backed-up credentials)
69+
oc rsh -n openstack rabbitmq-server-0 rabbitmqctl list_users
70+
71+
# Check transport URL secret was automatically created
72+
oc get secret rabbitmq-transport-url-nova-api-transport -n openstack
73+
74+
# Decode and verify transport URL (should reference the restored user)
75+
oc get secret rabbitmq-transport-url-nova-api-transport -n openstack -o jsonpath='{.data.transport_url}' | base64 -d
76+
```
77+
78+
**Solution:**
79+
80+
```bash
81+
# Option 1: Re-run the user restoration from step 8
82+
# Extract credentials from backup and add them again
83+
84+
# Get credentials from backup
85+
RABBITMQ_USER=$(jq -r '.items[] | select(.metadata.name=="rabbitmq-default-user") | .data.username' secrets-all-backup.json | base64 -d)
86+
RABBITMQ_PASS=$(jq -r '.items[] | select(.metadata.name=="rabbitmq-default-user") | .data.password' secrets-all-backup.json | base64 -d)
87+
88+
# Delete the user if it exists (to reset)
89+
oc rsh -n openstack rabbitmq-server-0 rabbitmqctl delete_user "${RABBITMQ_USER}" || echo "User doesn't exist"
90+
91+
# Re-add the user
92+
oc rsh -n openstack rabbitmq-server-0 rabbitmqctl add_user -- "${RABBITMQ_USER}" "${RABBITMQ_PASS}"
93+
oc rsh -n openstack rabbitmq-server-0 rabbitmqctl set_user_tags "${RABBITMQ_USER}" administrator
94+
oc rsh -n openstack rabbitmq-server-0 rabbitmqctl set_permissions -p / "${RABBITMQ_USER}" ".*" ".*" ".*"
95+
96+
# Verify user permissions
97+
oc rsh -n openstack rabbitmq-server-0 rabbitmqctl list_user_permissions "${RABBITMQ_USER}"
98+
99+
# Restart affected services to pick up credentials
100+
oc delete pod -n openstack -l service=nova
101+
```
102+
103+
**Prevention:**
104+
- Always verify RabbitMQ user restoration completed successfully (step 8)
105+
- Check `rabbitmqctl list_users` output includes the backed-up username
106+
- Test one service (e.g., nova-api) before assuming all services will work
107+
- **For EDPM deployments**: Verify compute and network node connectivity immediately after restore
108+
- Monitor data plane node logs during restore to catch issues early
109+
110+
**EDPM-Specific Checks:**
111+
112+
If data plane nodes (compute/network nodes) cannot connect:
113+
114+
```bash
115+
# On a compute node, check nova-compute configuration
116+
ssh compute-node-1
117+
sudo grep -i transport_url /var/lib/config-data/nova/etc/nova/nova.conf
118+
# Verify the username matches what you restored in step 8
119+
120+
# On a network node, check neutron agent configuration
121+
ssh network-node-1
122+
sudo grep -i transport_url /var/lib/config-data/neutron/etc/neutron/neutron.conf
123+
# Verify the username matches what you restored in step 8
124+
125+
# Test direct RabbitMQ connectivity from data plane node
126+
# Get RabbitMQ service IP
127+
oc get svc rabbitmq-cell1 -n openstack -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
128+
129+
# From compute node, test connection (requires amqp-tools)
130+
ssh compute-node-1
131+
# Test if port is reachable
132+
telnet <rabbitmq-cell1-ip> 5672
133+
```
134+
135+
If credentials don't match, you may need to either:
136+
1. Re-run step 8 to restore the correct user credentials in RabbitMQ
137+
2. Or update data plane node configurations (not recommended - requires reconfiguring all nodes)
138+
139+
### Issue: RabbitMQ RBAC/Ownership Errors
140+
141+
**Symptoms:**
142+
- RabbitMQ cluster CR shows `status: False` with RBAC errors
143+
- RabbitMQ operator logs show "forbidden: cannot set an ownerRef" errors
144+
- RabbitMQ pods fail to reconcile properly
145+
146+
**Diagnosis:**
147+
148+
This happens when RabbitMQ-related secrets or ConfigMaps are accidentally restored from backup instead of being filtered out. Check RabbitMQ cluster status:
149+
150+
```bash
151+
# Check RabbitMQ cluster status
152+
oc get rabbitmqcluster -n openstack
153+
oc describe rabbitmqcluster rabbitmq -n openstack
154+
155+
# Look for errors like these in the status or operator logs:
156+
# For Secrets:
157+
# secrets "rabbitmq-default-user" is forbidden: cannot set an ownerRef on a resource you can't delete:
158+
# RBAC: clusterrole.rbac.authorization.k8s.io "rabbitmq-cluster-operator-proxy-role" not found
159+
160+
# For ConfigMaps:
161+
# configmaps "rabbitmq-plugins-conf" is forbidden: cannot set an ownerRef on a resource you can't delete:
162+
# RBAC: clusterrole.rbac.authorization.k8s.io "rabbitmq-cluster-operator-proxy-role" not found
163+
164+
# In RabbitMQ cluster CR status:
165+
# - lastTransitionTime: "2026-01-20T16:32:08Z"
166+
# message: 'secrets "rabbitmq-default-user" is forbidden: cannot set an ownerRef on a resource you can't delete: RBAC: clusterrole.rbac.authorization.k8s.io "rabbitmq-cluster-operator-proxy-role" not found, <nil>'
167+
# reason: Error
168+
# status: "False"
169+
```
170+
171+
**Root Cause:**
172+
173+
When operators create resources, they own them from the start with no permission issues. When pre-existing resources are restored first, operators try to adopt them by setting `ownerReferences`, but Kubernetes requires delete permissions to set ownerReferences. The operator doesn't have delete permissions on resources it didn't create, causing the RBAC error.
174+
175+
**Solution:**
176+
177+
Delete the pre-existing RabbitMQ secrets/ConfigMaps and let the operator recreate them:
178+
179+
```bash
180+
# Delete RabbitMQ-related secrets
181+
oc delete secret -n openstack -l app.kubernetes.io/part-of=rabbitmq
182+
183+
# Delete RabbitMQ-related ConfigMaps
184+
oc delete configmap -n openstack -l app.kubernetes.io/part-of=rabbitmq
185+
186+
# Restart the RabbitMQ operator to trigger reconciliation
187+
oc delete pod -n openstack-operators -l control-plane=rabbitmq-cluster-operator
188+
189+
# Wait for RabbitMQ clusters to reconcile and create fresh resources
190+
oc get rabbitmqcluster -n openstack --watch
191+
192+
# After RabbitMQ clusters are ready, restore user credentials (step 11)
193+
# See "RabbitMQ User Management" in the Scope section for details
194+
```
195+
196+
**Prevention:**
197+
- Always use the smart filtering approach in step 4 (Restore Secrets) which excludes RabbitMQ resources
198+
- Never restore secrets/ConfigMaps with `app.kubernetes.io/part-of=rabbitmq` label
199+
- Review the restore script output to ensure RabbitMQ resources were filtered
200+
201+
### Issue: Operator Not Reconciling
202+
203+
```bash
204+
# Check operator is running
205+
oc get pods -n openstack-operators
206+
207+
# Check operator logs
208+
oc logs -n openstack-operators deployment/openstack-operator-controller-manager -f
209+
210+
# Verify CRDs are installed
211+
oc get crd | grep openstack
212+
```
213+
214+
### Issue: Secrets Not Found
215+
216+
```bash
217+
# Verify secret exists
218+
oc get secret <secret-name> -n openstack
219+
220+
# Check secret is referenced correctly in CR
221+
oc get openstackcontrolplane -n openstack -o jsonpath='{.items[0].spec.secret}'
222+
```
223+
224+
### Issue: Services Stuck in Pending
225+
226+
```bash
227+
# Check pod events
228+
oc describe pod <pod-name> -n openstack
229+
230+
# Common issues:
231+
# - Storage class not available
232+
# - Insufficient resources
233+
# - Image pull errors
234+
```
235+
236+
### Issue: Different StorageClass in Target
237+
238+
```bash
239+
# List available storage classes in OpenShift
240+
oc get storageclass
241+
242+
# Check which one is marked as default
243+
oc get storageclass -o jsonpath='{.items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")].metadata.name}'
244+
245+
# Update the control plane CR before applying
246+
# You can edit the JSON file directly or use jq to update it
247+
vi openstackcontrolplane-backup.json
248+
# Change global storageClass to available class in target cluster
249+
# Also check for service-specific storage class overrides in service templates:
250+
# - spec.galera.templates[*].storageClass
251+
# - spec.rabbitmq.templates[*].persistence.storageClassName
252+
253+
# Or use jq to update programmatically:
254+
# jq '.items[0].spec.storageClass = "new-storage-class-name"' openstackcontrolplane-backup.json > openstackcontrolplane-backup.json.tmp
255+
# mv openstackcontrolplane-backup.json.tmp openstackcontrolplane-backup.json
256+
# - spec.ovn.template.ovnDBCluster[*].storageClass
257+
258+
# Common OpenShift storage classes:
259+
# - ocs-storagecluster-ceph-rbd (OpenShift Data Foundation)
260+
# - local-storage
261+
```
262+
263+
---

0 commit comments

Comments
 (0)