Skip to content

Commit 0123194

Browse files
authored
Adding explanation about potential SPIRE agents issues after cluster restart (#129)
* Adding explanation about potential SPIRE server and agents issues in case of cluster longer shutdown * Fixing lint errors
1 parent 12243a0 commit 0123194

1 file changed

Lines changed: 131 additions & 0 deletions

File tree

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# SPIRE Self-Signed CA Expiration on Cluster Restart
2+
3+
## Overview
4+
5+
When the ZTVP pattern is deployed with SPIRE's default self-signed CA (no upstream authority configured), a cluster shutdown/restart that exceeds the CA TTL will break SPIRE agent attestation. This results in all SPIFFE-dependent workloads (e.g., qtodo) failing to start until manual intervention is performed.
6+
7+
> **Note:** This issue does not affect production deployments that use an external/upstream CA (e.g., customer PKI, Vault PKI backend, cert-manager). In those configurations, the trust anchor persists across restarts.
8+
9+
## Problem Description
10+
11+
### SPIRE Server Default CA Configuration
12+
13+
The SPIRE server is configured with a self-signed CA with the following TTLs (from the `spire-server` ConfigMap in the `zero-trust-workload-identity-manager` namespace):
14+
15+
| Parameter | Value | Description |
16+
|---|---|---|
17+
| `ca_ttl` | `24h` | Lifetime of the self-signed X509 CA certificate |
18+
| `default_x509_svid_ttl` | `1h` | Lifetime of X509 SVIDs issued to workloads |
19+
| `default_jwt_svid_ttl` | `5m` | Lifetime of JWT SVIDs issued to workloads |
20+
21+
The CA rotates approximately every 12 hours (half of `ca_ttl`), and the `k8sbundle` notifier plugin pushes the current trust bundle to the `spire-bundle` ConfigMap.
22+
23+
### What Happens During Cluster Shutdown
24+
25+
1. The cluster shuts down — all pods (including SPIRE server and agents) stop
26+
2. The X509 CA certificate continues to age while the cluster is offline
27+
3. If the cluster is offline for longer than `ca_ttl` (24 hours), the CA expires
28+
29+
### What Happens on Cluster Restart
30+
31+
1. **SPIRE server** starts first, detects all CA slots are expired, and generates a **new self-signed CA** with a new key pair
32+
2. **SPIRE agents** (DaemonSet) start and attempt to re-attest against the server
33+
3. Agents still have the **old trust bundle** cached — they do not trust the server's new CA
34+
4. Agent attestation fails with:
35+
36+
```bash
37+
transport: authentication handshake failed: x509svid: could not verify leaf certificate:
38+
x509: certificate signed by unknown authority
39+
```
40+
41+
5. Agents enter `CrashLoopBackOff` or `Error` state
42+
6. **All SPIFFE-dependent workloads** (e.g., qtodo) cannot obtain SVIDs and remain stuck in init or fail health checks
43+
44+
### Symptoms
45+
46+
- SPIRE agent pods in `CrashLoopBackOff` or `Error` state across all nodes
47+
- SPIRE server logs showing `X509CA slot unusable — slot expired` for all slots
48+
- SPIRE server logs showing a new CA was prepared and activated
49+
- Workload pods stuck in `Init` state (SPIFFE helper init containers cannot connect to the agent)
50+
- SPIFFE OIDC discovery provider restarting repeatedly
51+
52+
## Recovery Procedure
53+
54+
### Step 1: Verify the Issue
55+
56+
Check the SPIRE server logs for expired CA slots and new CA generation:
57+
58+
```bash
59+
oc logs spire-server-0 -n zero-trust-workload-identity-manager -c spire-server | grep -E "slot unusable|CA prepared|CA activated"
60+
```
61+
62+
Expected output showing expired old slots and a newly activated CA:
63+
64+
```bash
65+
level=warning msg="X509CA slot unusable" error="slot expired" ...
66+
level=info msg="X509 CA prepared" ...
67+
level=info msg="X509 CA activated" ...
68+
```
69+
70+
Check SPIRE agent logs for the trust failure:
71+
72+
```bash
73+
oc logs <spire-agent-pod> -n zero-trust-workload-identity-manager | grep "unknown authority"
74+
```
75+
76+
### Step 2: Restart SPIRE Agents
77+
78+
Restart the SPIRE agent DaemonSet so agents pick up the new trust bundle from the `spire-bundle` ConfigMap (which the server's `k8sbundle` notifier has already updated):
79+
80+
```bash
81+
oc rollout restart daemonset/spire-agent -n zero-trust-workload-identity-manager
82+
```
83+
84+
Wait for all agents to become ready:
85+
86+
```bash
87+
oc rollout status daemonset/spire-agent -n zero-trust-workload-identity-manager
88+
```
89+
90+
### Step 3: Restart Affected Workloads
91+
92+
Any workload pods that were stuck in init or crash-looping due to SVID acquisition failure need to be restarted:
93+
94+
```bash
95+
# Example: restart qtodo
96+
oc delete pod -l app=qtodo -n qtodo
97+
98+
# Check all namespaces for stuck pods with SPIFFE init containers
99+
oc get pods --all-namespaces | grep -E 'Init|Error|CrashLoop'
100+
```
101+
102+
### Step 4: Verify Recovery
103+
104+
Confirm SPIRE agents are healthy:
105+
106+
```bash
107+
oc get pods -n zero-trust-workload-identity-manager | grep spire-agent
108+
# All should show 1/1 Running with 0 restarts
109+
```
110+
111+
Confirm workloads are running:
112+
113+
```bash
114+
oc get pods -n qtodo
115+
# qtodo pod should show 3/3 Running
116+
```
117+
118+
## Production Consideration
119+
120+
This issue is specific to the **self-signed CA** configuration used in the default ZTVP deployment. In production environments, customers should configure SPIRE with an **UpstreamAuthority plugin** pointing to their organization's PKI infrastructure. With an upstream CA:
121+
122+
- The trust anchor remains stable across SPIRE server restarts
123+
- SPIRE agents always trust the server's certificate chain because it chains back to the persistent upstream CA
124+
- No manual intervention is required after cluster restart regardless of downtime duration
125+
126+
Supported upstream authority plugins include:
127+
128+
- **Vault PKI** (`upstream_authority "vault"`) — integrates with HashiCorp Vault's PKI secrets engine
129+
- **cert-manager** (`upstream_authority "cert_manager"`) — uses Kubernetes cert-manager as the CA
130+
- **AWS PCA** (`upstream_authority "aws_pca"`) — uses AWS Private Certificate Authority
131+
- **Disk-based** (`upstream_authority "disk"`) — uses a CA certificate and key stored on disk

0 commit comments

Comments
 (0)