55InstanceHA is a high-availability service for OpenStack that automatically detects and evacuates instances from failed compute nodes.
66
77** Version** : 2.5
8- ** Code Size** : 3,645 lines
9- ** Test Suite** : 656 tests across 17 test suites
108
119## Table of Contents
1210
@@ -134,7 +132,7 @@ _config_map: Dict[str, ConfigItem] = {
134132 ' HASH_INTERVAL' : ConfigItem(' int' , 60 , 30 , 300 ),
135133 ' ORCHESTRATED_RESTART' : ConfigItem(' bool' , False ),
136134 ' SKIP_SERVERS_WITH_NAME' : ConfigItem(' list' , []),
137- ' EVACUATION_RETRIES' : ConfigItem(' int' , 5 , 1 , 20 ),
135+ ' EVACUATION_RETRIES' : ConfigItem(' int' , DEFAULT_EVACUATION_RETRIES , 1 , 20 ), # DEFAULT_EVACUATION_RETRIES = 5
138136}
139137```
140138
@@ -297,16 +295,22 @@ def track_host_processing(service: 'InstanceHAService', hostname: str):
297295** UDP Socket Management** :
298296``` python
299297class UDPSocketManager :
300- """ Context manager for UDP socket with proper resource cleanup."""
298+ """ Context manager for UDP socket with proper resource cleanup.
299+ Supports IPv4, IPv6, and dual-stack (::) binding."""
301300 def __init__ (self , udp_ip , udp_port , label = ' UDP' ):
302301 self .udp_ip = udp_ip
303302 self .udp_port = udp_port
304303 self .label = label
304+ self .socket = None
305305
306306 def __enter__ (self ):
307- self .socket = socket.socket(socket.AF_INET , socket.SOCK_DGRAM )
307+ info = socket.getaddrinfo(self .udp_ip or ' ::' , self .udp_port,
308+ type = socket.SOCK_DGRAM )[0 ]
309+ self .socket = socket.socket(info[0 ], info[1 ])
310+ if info[0 ] == socket.AF_INET6 :
311+ self .socket.setsockopt(socket.IPPROTO_IPV6 , socket.IPV6_V6ONLY , 0 )
308312 self .socket.settimeout(1.0 )
309- self .socket.bind(( self .udp_ip, self .udp_port) )
313+ self .socket.bind(info[ 4 ] )
310314 return self .socket
311315
312316 def __exit__ (self , exc_type , exc_val , exc_tb ):
@@ -539,12 +543,12 @@ process_service(failed_service, reserved_hosts, resume, service)
539543 │ Called for kdump-fenced (not yet disabled)
540544 │
541545 ├─ 3. Manage Reserved Hosts
542- │ └─ _manage_reserved_hosts(conn, failed_service, reserved_hosts)
546+ │ └─ _manage_reserved_hosts(conn, failed_service, reserved_hosts, service )
543547 │ ├─ Match by aggregate or zone
544548 │ └─ Enable matching reserved host
545549 │
546550 ├─ 4. Evacuate Servers
547- │ └─ _host_evacuate(connection, failed_service, service)
551+ │ └─ _host_evacuate(connection, failed_service, service, target_host=None )
548552 │ ├─ Get evacuable images/flavors
549553 │ ├─ List servers on host
550554 │ ├─ Filter evacuable servers (ACTIVE, ERROR, SHUTOFF)
@@ -1082,7 +1086,7 @@ In addition, each event emission site also increments a corresponding Prometheus
10821086- ` POD_NAMESPACE` environment variable set
10831087- ` INSTANCEHA_CR_NAME` environment variable set to the InstanceHa CR name
10841088- ` POD_NAME` environment variable set (used as `reportingInstance`)
1085- - RBAC : the operator ClusterRole must have `create` and `patch` permissions on `events`
1089+ - RBAC : the pod's namespaced Role must have `create` and `patch` permissions on `events`
10861090
10871091# ## Event Catalog
10881092
@@ -1649,6 +1653,12 @@ This ensures orphaned hosts are not left powered off indefinitely after a pod cr
16491653
16501654Emits `OrphanedHostRecovered` (Warning) for each recovered host.
16511655
1656+ # ## Known Limitations
1657+
1658+ **Lock reconciliation scope**: The VM unlock sweep at startup only checks servers on hosts that are `forced_down AND state='down'`. This covers the realistic crash scenario (pod dies mid-evacuation while the host is still fenced). However, if the pod crashes after evacuation succeeds and post-recovery clears `forced_down`, but before the `finally` block unlocks the VMs, the locked VMs on their new (healthy) hosts would not be found by the reconciliation. This window is extremely narrow — the `finally` block in `_server_evacuate_future` runs per-server immediately after each evacuation completes, so the crash would need to occur between the evacuation completing and the `finally` executing. In practice, a SIGKILL (OOM, node crash) during concurrent evacuation would leave VMs on hosts that are still `forced_down`, which the current reconciliation handles correctly.
1659+
1660+ A broader scan (checking all servers with `locked_reason == LOCK_REASON_EVACUATION` regardless of host state) would close this theoretical gap but requires listing all servers in the cloud at startup, which may be expensive on large deployments.
1661+
16521662---
16531663
16541664# # Graceful Shutdown
@@ -1826,10 +1836,6 @@ with concurrent.futures.ThreadPoolExecutor(max_workers=WORKERS) as executor:
18261836
18271837# # Testing
18281838
1829- # ## Test Statistics
1830-
1831- - **Total Tests**: 656 across 17 test suites
1832-
18331839# ## Test Categories
18341840
18351841**1. Core Unit Tests** (`test_unit_core.py`):
@@ -1938,6 +1944,17 @@ Core unit tests covering:
19381944- Branching logic : ` _host_evacuate` orchestrated/smart/traditional routing
19391945- Configuration : ` ORCHESTRATED_RESTART` config key validation
19401946
1947+ **18. Aggregate Threshold Tests** (`test_aggregate_threshold.py`):
1948+ - Per-aggregate failure limit enforcement via `instanceha:max_failures` metadata
1949+ - Multi-aggregate host membership and most-restrictive-limit selection
1950+ - Interaction with global THRESHOLD percentage check
1951+
1952+ **19. IPv6 UDP Tests** (`test_ipv6_udp.py`):
1953+ - UDPSocketManager binding to IPv6 loopback, wildcard, and IPv4 addresses
1954+ - Heartbeat reception over IPv6 (single, multiple, native byte order)
1955+ - Kdump reception over IPv6 (single, multiple, invalid magic rejection)
1956+ - Dual-stack (`::`) listener accepting both IPv4 and IPv6 packets
1957+
19411958# ## Coverage by Component
19421959
19431960| Component | Coverage |
@@ -1978,24 +1995,75 @@ with patch('instanceha.time.sleep'):
19781995apiVersion: apps/v1
19791996kind: Deployment
19801997metadata:
1981- name: instanceha
1998+ name: instanceha # matches the InstanceHa CR .metadata.name
19821999spec:
19832000 replicas: 1
2001+ strategy:
2002+ type: Recreate
19842003 template:
19852004 spec:
2005+ serviceAccountName: instanceha-instanceha
2006+ securityContext:
2007+ fsGroup: 42401
2008+ terminationGracePeriodSeconds: 30
19862009 containers:
19872010 - name: instanceha
1988- image: quay.io/openstack-k8s-operators/instanceha:latest
2011+ image: <resolved from infra-instanceha-config ConfigMap or RELATED_IMAGE env var>
2012+ command: ["/usr/bin/python3", "-u", "/var/lib/instanceha/instanceha.py"]
2013+ securityContext:
2014+ runAsUser: 42401
2015+ runAsGroup: 42401
2016+ runAsNonRoot: true
2017+ allowPrivilegeEscalation: false
2018+ capabilities:
2019+ drop: ["ALL"]
19892020 env:
19902021 - name: OS_CLOUD
1991- value: overcloud
2022+ value: default # from spec.openStackCloud
2023+ - name: CONFIG_HASH
2024+ value: <hash of all input configmaps/secrets>
2025+ - name: INSTANCEHA_DISABLED
2026+ value: "False" # from spec.disabled
2027+ - name: POD_NAME
2028+ valueFrom:
2029+ fieldRef:
2030+ fieldPath: metadata.name
2031+ - name: POD_NAMESPACE
2032+ valueFrom:
2033+ fieldRef:
2034+ fieldPath: metadata.namespace
2035+ - name: INSTANCEHA_CR_NAME
2036+ value: instanceha # the CR name
2037+ - name: HEARTBEAT_PORT # only set when > 0
2038+ value: "7411"
2039+ ports:
2040+ - containerPort: 8080
2041+ protocol: TCP
2042+ name: metrics
2043+ - containerPort: 7410
2044+ protocol: UDP
2045+ name: kdump
2046+ - containerPort: 7411
2047+ protocol: UDP
2048+ name: heartbeat
19922049 volumeMounts:
1993- - name: config
1994- mountPath: /var/lib/instanceha
1995- - name: clouds
1996- mountPath: /home/cloud-admin/.config/openstack
1997- - name: fencing
1998- mountPath: /secrets
2050+ - name: openstack-config
2051+ mountPath: /home/cloud-admin/.config/openstack/clouds.yaml
2052+ subPath: clouds.yaml
2053+ - name: openstack-config-secret
2054+ mountPath: /home/cloud-admin/.config/openstack/secure.yaml
2055+ subPath: secure.yaml
2056+ - name: fencing-secret
2057+ mountPath: /secrets/fencing.yaml
2058+ subPath: fencing.yaml
2059+ - name: instanceha-script
2060+ mountPath: /var/lib/instanceha/instanceha.py
2061+ subPath: instanceha.py
2062+ readOnly: true
2063+ - name: instanceha-config
2064+ mountPath: /var/lib/instanceha/config.yaml
2065+ subPath: config.yaml
2066+ readOnly: true
19992067 livenessProbe:
20002068 httpGet:
20012069 path: /
@@ -2011,15 +2079,24 @@ spec:
20112079 periodSeconds: 10
20122080 timeoutSeconds: 10
20132081 volumes:
2014- - name: config
2082+ - name: openstack- config
20152083 configMap:
2016- name: instanceha -config
2017- - name: clouds
2084+ name: openstack -config # from spec.openStackConfigMap
2085+ - name: openstack-config-secret
20182086 secret:
2019- secretName: clouds-yaml
2020- - name: fencing
2087+ secretName: openstack-config-secret # from spec.openStackConfigSecret
2088+ defaultMode: 0440
2089+ - name: fencing-secret
20212090 secret:
2022- secretName: fencing-credentials
2091+ secretName: fencing-secret # from spec.fencingSecret
2092+ defaultMode: 0440
2093+ - name: instanceha-script
2094+ configMap:
2095+ name: instanceha-sh # auto-generated: <cr-name>-sh
2096+ defaultMode: 0644
2097+ - name: instanceha-config
2098+ configMap:
2099+ name: instanceha-config # from spec.instanceHaConfigMap
20232100` ` `
20242101
20252102---
@@ -2148,8 +2225,8 @@ config:
21482225` ` `
21492226
21502227**Notes**:
2151- - Value of `0` disables threshold checking
2152- - Value of `100` blocks all evacuations
2228+ - Value of `0` blocks all evacuations (any failure percentage exceeds 0%)
2229+ - Value of `100` disables threshold checking (nothing can exceed 100%)
21532230- Aggregate-aware calculation uses only evacuable hosts as denominator
21542231
21552232---
@@ -3004,8 +3081,8 @@ config:
30043081
30053082# # References
30063083
3007- - **Code**: `instanceha.py` (3,572 lines)
3008- - **Tests**: 656 tests across 17 test suites
3084+ - **Code**: `instanceha.py`
3085+ - **Tests**:
30093086 - ` test_unit_core.py` (core unit tests)
30103087 - ` test_fencing_agents.py` (fencing agent tests)
30113088 - ` test_kdump_detection.py` (kdump detection tests)
@@ -3023,6 +3100,8 @@ config:
30233100 - ` test_orchestrated_evacuation.py` (orchestrated evacuation tests)
30243101 - ` test_heartbeat_detection.py` (heartbeat detection tests)
30253102 - ` test_heartbeat_scale.py` (heartbeat scale tests)
3103+ - ` test_aggregate_threshold.py` (per-aggregate failure threshold tests)
3104+ - ` test_ipv6_udp.py` (IPv6 UDP listener tests)
30263105- **Documentation**:
30273106 - This file (instanceha_architecture.md)
30283107 - [instanceha_guide.md](instanceha_guide.md) — Operator deployment and configuration guide
0 commit comments