Skip to content

Commit 9b54405

Browse files
Merge pull request #586 from lmiccini/iha-docs-fixes
Fix InstanceHA documentation inaccuracies and add MetalLB guide
2 parents c8a2505 + fdcfb64 commit 9b54405

5 files changed

Lines changed: 209 additions & 42 deletions

File tree

config/samples/instanceha_v1beta1_instanceha.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ spec:
1313
#fencingSecret:
1414
#instanceHaConfigMap:
1515
#instanceHaKdumpPort:
16+
#instanceHaHeartbeatPort:

docs/instanceha_architecture.md

Lines changed: 111 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@
55
InstanceHA is a high-availability service for OpenStack that automatically detects and evacuates instances from failed compute nodes.
66

77
**Version**: 2.5
8-
**Code Size**: 3,645 lines
9-
**Test Suite**: 656 tests across 17 test suites
108

119
## Table of Contents
1210

@@ -134,7 +132,7 @@ _config_map: Dict[str, ConfigItem] = {
134132
'HASH_INTERVAL': ConfigItem('int', 60, 30, 300),
135133
'ORCHESTRATED_RESTART': ConfigItem('bool', False),
136134
'SKIP_SERVERS_WITH_NAME': ConfigItem('list', []),
137-
'EVACUATION_RETRIES': ConfigItem('int', 5, 1, 20),
135+
'EVACUATION_RETRIES': ConfigItem('int', DEFAULT_EVACUATION_RETRIES, 1, 20), # DEFAULT_EVACUATION_RETRIES = 5
138136
}
139137
```
140138

@@ -297,16 +295,22 @@ def track_host_processing(service: 'InstanceHAService', hostname: str):
297295
**UDP Socket Management**:
298296
```python
299297
class UDPSocketManager:
300-
"""Context manager for UDP socket with proper resource cleanup."""
298+
"""Context manager for UDP socket with proper resource cleanup.
299+
Supports IPv4, IPv6, and dual-stack (::) binding."""
301300
def __init__(self, udp_ip, udp_port, label='UDP'):
302301
self.udp_ip = udp_ip
303302
self.udp_port = udp_port
304303
self.label = label
304+
self.socket = None
305305

306306
def __enter__(self):
307-
self.socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
307+
info = socket.getaddrinfo(self.udp_ip or '::', self.udp_port,
308+
type=socket.SOCK_DGRAM)[0]
309+
self.socket = socket.socket(info[0], info[1])
310+
if info[0] == socket.AF_INET6:
311+
self.socket.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_V6ONLY, 0)
308312
self.socket.settimeout(1.0)
309-
self.socket.bind((self.udp_ip, self.udp_port))
313+
self.socket.bind(info[4])
310314
return self.socket
311315

312316
def __exit__(self, exc_type, exc_val, exc_tb):
@@ -539,12 +543,12 @@ process_service(failed_service, reserved_hosts, resume, service)
539543
│ Called for kdump-fenced (not yet disabled)
540544
541545
├─ 3. Manage Reserved Hosts
542-
│ └─ _manage_reserved_hosts(conn, failed_service, reserved_hosts)
546+
│ └─ _manage_reserved_hosts(conn, failed_service, reserved_hosts, service)
543547
│ ├─ Match by aggregate or zone
544548
│ └─ Enable matching reserved host
545549
546550
├─ 4. Evacuate Servers
547-
│ └─ _host_evacuate(connection, failed_service, service)
551+
│ └─ _host_evacuate(connection, failed_service, service, target_host=None)
548552
│ ├─ Get evacuable images/flavors
549553
│ ├─ List servers on host
550554
│ ├─ Filter evacuable servers (ACTIVE, ERROR, SHUTOFF)
@@ -1082,7 +1086,7 @@ In addition, each event emission site also increments a corresponding Prometheus
10821086
- `POD_NAMESPACE` environment variable set
10831087
- `INSTANCEHA_CR_NAME` environment variable set to the InstanceHa CR name
10841088
- `POD_NAME` environment variable set (used as `reportingInstance`)
1085-
- RBAC: the operator ClusterRole must have `create` and `patch` permissions on `events`
1089+
- RBAC: the pod's namespaced Role must have `create` and `patch` permissions on `events`
10861090

10871091
### Event Catalog
10881092

@@ -1649,6 +1653,12 @@ This ensures orphaned hosts are not left powered off indefinitely after a pod cr
16491653

16501654
Emits `OrphanedHostRecovered` (Warning) for each recovered host.
16511655

1656+
### Known Limitations
1657+
1658+
**Lock reconciliation scope**: The VM unlock sweep at startup only checks servers on hosts that are `forced_down AND state='down'`. This covers the realistic crash scenario (pod dies mid-evacuation while the host is still fenced). However, if the pod crashes after evacuation succeeds and post-recovery clears `forced_down`, but before the `finally` block unlocks the VMs, the locked VMs on their new (healthy) hosts would not be found by the reconciliation. This window is extremely narrow — the `finally` block in `_server_evacuate_future` runs per-server immediately after each evacuation completes, so the crash would need to occur between the evacuation completing and the `finally` executing. In practice, a SIGKILL (OOM, node crash) during concurrent evacuation would leave VMs on hosts that are still `forced_down`, which the current reconciliation handles correctly.
1659+
1660+
A broader scan (checking all servers with `locked_reason == LOCK_REASON_EVACUATION` regardless of host state) would close this theoretical gap but requires listing all servers in the cloud at startup, which may be expensive on large deployments.
1661+
16521662
---
16531663

16541664
## Graceful Shutdown
@@ -1826,10 +1836,6 @@ with concurrent.futures.ThreadPoolExecutor(max_workers=WORKERS) as executor:
18261836

18271837
## Testing
18281838

1829-
### Test Statistics
1830-
1831-
- **Total Tests**: 656 across 17 test suites
1832-
18331839
### Test Categories
18341840

18351841
**1. Core Unit Tests** (`test_unit_core.py`):
@@ -1938,6 +1944,17 @@ Core unit tests covering:
19381944
- Branching logic: `_host_evacuate` orchestrated/smart/traditional routing
19391945
- Configuration: `ORCHESTRATED_RESTART` config key validation
19401946

1947+
**18. Aggregate Threshold Tests** (`test_aggregate_threshold.py`):
1948+
- Per-aggregate failure limit enforcement via `instanceha:max_failures` metadata
1949+
- Multi-aggregate host membership and most-restrictive-limit selection
1950+
- Interaction with global THRESHOLD percentage check
1951+
1952+
**19. IPv6 UDP Tests** (`test_ipv6_udp.py`):
1953+
- UDPSocketManager binding to IPv6 loopback, wildcard, and IPv4 addresses
1954+
- Heartbeat reception over IPv6 (single, multiple, native byte order)
1955+
- Kdump reception over IPv6 (single, multiple, invalid magic rejection)
1956+
- Dual-stack (`::`) listener accepting both IPv4 and IPv6 packets
1957+
19411958
### Coverage by Component
19421959

19431960
| Component | Coverage |
@@ -1978,24 +1995,75 @@ with patch('instanceha.time.sleep'):
19781995
apiVersion: apps/v1
19791996
kind: Deployment
19801997
metadata:
1981-
name: instanceha
1998+
name: instanceha # matches the InstanceHa CR .metadata.name
19821999
spec:
19832000
replicas: 1
2001+
strategy:
2002+
type: Recreate
19842003
template:
19852004
spec:
2005+
serviceAccountName: instanceha-instanceha
2006+
securityContext:
2007+
fsGroup: 42401
2008+
terminationGracePeriodSeconds: 30
19862009
containers:
19872010
- name: instanceha
1988-
image: quay.io/openstack-k8s-operators/instanceha:latest
2011+
image: <resolved from infra-instanceha-config ConfigMap or RELATED_IMAGE env var>
2012+
command: ["/usr/bin/python3", "-u", "/var/lib/instanceha/instanceha.py"]
2013+
securityContext:
2014+
runAsUser: 42401
2015+
runAsGroup: 42401
2016+
runAsNonRoot: true
2017+
allowPrivilegeEscalation: false
2018+
capabilities:
2019+
drop: ["ALL"]
19892020
env:
19902021
- name: OS_CLOUD
1991-
value: overcloud
2022+
value: default # from spec.openStackCloud
2023+
- name: CONFIG_HASH
2024+
value: <hash of all input configmaps/secrets>
2025+
- name: INSTANCEHA_DISABLED
2026+
value: "False" # from spec.disabled
2027+
- name: POD_NAME
2028+
valueFrom:
2029+
fieldRef:
2030+
fieldPath: metadata.name
2031+
- name: POD_NAMESPACE
2032+
valueFrom:
2033+
fieldRef:
2034+
fieldPath: metadata.namespace
2035+
- name: INSTANCEHA_CR_NAME
2036+
value: instanceha # the CR name
2037+
- name: HEARTBEAT_PORT # only set when > 0
2038+
value: "7411"
2039+
ports:
2040+
- containerPort: 8080
2041+
protocol: TCP
2042+
name: metrics
2043+
- containerPort: 7410
2044+
protocol: UDP
2045+
name: kdump
2046+
- containerPort: 7411
2047+
protocol: UDP
2048+
name: heartbeat
19922049
volumeMounts:
1993-
- name: config
1994-
mountPath: /var/lib/instanceha
1995-
- name: clouds
1996-
mountPath: /home/cloud-admin/.config/openstack
1997-
- name: fencing
1998-
mountPath: /secrets
2050+
- name: openstack-config
2051+
mountPath: /home/cloud-admin/.config/openstack/clouds.yaml
2052+
subPath: clouds.yaml
2053+
- name: openstack-config-secret
2054+
mountPath: /home/cloud-admin/.config/openstack/secure.yaml
2055+
subPath: secure.yaml
2056+
- name: fencing-secret
2057+
mountPath: /secrets/fencing.yaml
2058+
subPath: fencing.yaml
2059+
- name: instanceha-script
2060+
mountPath: /var/lib/instanceha/instanceha.py
2061+
subPath: instanceha.py
2062+
readOnly: true
2063+
- name: instanceha-config
2064+
mountPath: /var/lib/instanceha/config.yaml
2065+
subPath: config.yaml
2066+
readOnly: true
19992067
livenessProbe:
20002068
httpGet:
20012069
path: /
@@ -2011,15 +2079,24 @@ spec:
20112079
periodSeconds: 10
20122080
timeoutSeconds: 10
20132081
volumes:
2014-
- name: config
2082+
- name: openstack-config
20152083
configMap:
2016-
name: instanceha-config
2017-
- name: clouds
2084+
name: openstack-config # from spec.openStackConfigMap
2085+
- name: openstack-config-secret
20182086
secret:
2019-
secretName: clouds-yaml
2020-
- name: fencing
2087+
secretName: openstack-config-secret # from spec.openStackConfigSecret
2088+
defaultMode: 0440
2089+
- name: fencing-secret
20212090
secret:
2022-
secretName: fencing-credentials
2091+
secretName: fencing-secret # from spec.fencingSecret
2092+
defaultMode: 0440
2093+
- name: instanceha-script
2094+
configMap:
2095+
name: instanceha-sh # auto-generated: <cr-name>-sh
2096+
defaultMode: 0644
2097+
- name: instanceha-config
2098+
configMap:
2099+
name: instanceha-config # from spec.instanceHaConfigMap
20232100
```
20242101

20252102
---
@@ -2148,8 +2225,8 @@ config:
21482225
```
21492226

21502227
**Notes**:
2151-
- Value of `0` disables threshold checking
2152-
- Value of `100` blocks all evacuations
2228+
- Value of `0` blocks all evacuations (any failure percentage exceeds 0%)
2229+
- Value of `100` disables threshold checking (nothing can exceed 100%)
21532230
- Aggregate-aware calculation uses only evacuable hosts as denominator
21542231

21552232
---
@@ -3004,8 +3081,8 @@ config:
30043081

30053082
## References
30063083

3007-
- **Code**: `instanceha.py` (3,572 lines)
3008-
- **Tests**: 656 tests across 17 test suites
3084+
- **Code**: `instanceha.py`
3085+
- **Tests**:
30093086
- `test_unit_core.py` (core unit tests)
30103087
- `test_fencing_agents.py` (fencing agent tests)
30113088
- `test_kdump_detection.py` (kdump detection tests)
@@ -3023,6 +3100,8 @@ config:
30233100
- `test_orchestrated_evacuation.py` (orchestrated evacuation tests)
30243101
- `test_heartbeat_detection.py` (heartbeat detection tests)
30253102
- `test_heartbeat_scale.py` (heartbeat scale tests)
3103+
- `test_aggregate_threshold.py` (per-aggregate failure threshold tests)
3104+
- `test_ipv6_udp.py` (IPv6 UDP listener tests)
30263105
- **Documentation**:
30273106
- This file (instanceha_architecture.md)
30283107
- [instanceha_guide.md](instanceha_guide.md) — Operator deployment and configuration guide

0 commit comments

Comments
 (0)