What happened:
The Manila CSI node plugin crashes immediately on startup if its dependency, the proxied CSI driver socket (e.g., the NFS CSI plugin at /var/lib/kubelet/plugins/csi-nfsplugin/csi.sock), is not yet available. There is no retry logic for transient connection errors, causing a fatal exit instead of a graceful retry.
What you expected to happen:
The Manila CSI node plugin should retry connecting to the proxied CSI driver endpoint when it encounters transient connection errors (codes.Unavailable), similar to how it already retries on timeout errors (codes.DeadlineExceeded).
Root Cause Analysis:
In pkg/csi/manila/driver.go initProxiedDriver(), the driver creates a 15-second context and calls ProbeForever to probe the proxied CSI plugin socket. However, ProbeForever (from csi-lib-utils/rpc/common.go) only retries when the gRPC status code is DeadlineExceeded. When the socket returns "connection refused" or "no such file or directory", the gRPC status code is Unavailable, and ProbeForever immediately returns an error:
// csi-lib-utils/rpc/common.go
if st.Code() != codes.DeadlineExceeded {
return fmt.Errorf("CSI driver probe failed: %s", err)
}
This error propagates up through initProxiedDriver() → SetupNodeService() → main.go which calls klog.Fatalf, terminating the process. The 15-second retry window is never actually used because only timeouts trigger a retry, not connection refusals.
The full error chain looks like:
F0511 02:55:43.317504 1 main.go:112] Driver node service initialization failed:
failed to initialize proxied CSI driver: probe failed: CSI driver probe failed:
rpc error: code = Unavailable desc = connection error: desc = "transport: Error
while dialing: dial unix /var/lib/kubelet/plugins/csi-nfsplugin/csi.sock: connect:
connection refused"
How to reproduce it:
- Deploy the Manila CSI driver with a proxied driver (e.g., NFS CSI plugin) on a Kubernetes/OpenShift cluster
- Reboot a node (or delete the proxied driver pod so its socket disappears)
- Both DaemonSet pods restart concurrently on the node
- The Manila CSI node plugin attempts to connect to the NFS socket before it's ready
- The Manila plugin fatally exits within ~1 second
Anything else we need to know?:
A possible fix could be to add retry logic for codes.Unavailable errors in initProxiedDriver(). The fix can be applied locally in cloud-provider-openstack without changes to csi-lib-utils, by wrapping the ProbeForever call in a retry loop. For example:
func (d *Driver) initProxiedDriver() (csiNodeCapabilitySet, error) {
conn, err := d.csiClientBuilder.NewConnection(d.fwdEndpoint)
if err != nil {
return nil, fmt.Errorf("connecting to %s endpoint failed: %v", d.fwdEndpoint, err)
}
defer conn.Close()
ctx, cancel := context.WithTimeout(context.Background(), time.Second*15)
defer cancel()
identityClient := d.csiClientBuilder.NewIdentityServiceClient(conn)
// Retry probing on transient connection errors (e.g., Unavailable)
// since the proxied CSI driver may not be ready yet after a node restart.
for {
err = identityClient.ProbeForever(ctx, conn, time.Second*5)
if err == nil {
break
}
st, ok := status.FromError(err)
if ok && st.Code() == codes.Unavailable {
klog.Infof("Proxied CSI driver not yet available, retrying: %v", err)
select {
case <-ctx.Done():
return nil, fmt.Errorf("timed out waiting for proxied CSI driver: %v", err)
case <-time.After(time.Second):
continue
}
}
return nil, fmt.Errorf("probe failed: %v", err)
}
// ... rest of the function unchanged
Alternatively, an upstream fix to csi-lib-utils ProbeForever could be proposed to also retry on codes.Unavailable, since this is the expected transient error when a co-located CSI driver hasn't started yet.
What happened:
The Manila CSI node plugin crashes immediately on startup if its dependency, the proxied CSI driver socket (e.g., the NFS CSI plugin at
/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock), is not yet available. There is no retry logic for transient connection errors, causing a fatal exit instead of a graceful retry.What you expected to happen:
The Manila CSI node plugin should retry connecting to the proxied CSI driver endpoint when it encounters transient connection errors (
codes.Unavailable), similar to how it already retries on timeout errors (codes.DeadlineExceeded).Root Cause Analysis:
In
pkg/csi/manila/driver.goinitProxiedDriver(), the driver creates a 15-second context and callsProbeForeverto probe the proxied CSI plugin socket. However,ProbeForever(fromcsi-lib-utils/rpc/common.go) only retries when the gRPC status code isDeadlineExceeded. When the socket returns "connection refused" or "no such file or directory", the gRPC status code isUnavailable, andProbeForeverimmediately returns an error:This error propagates up through
initProxiedDriver()→SetupNodeService()→main.gowhich callsklog.Fatalf, terminating the process. The 15-second retry window is never actually used because only timeouts trigger a retry, not connection refusals.The full error chain looks like:
How to reproduce it:
Anything else we need to know?:
A possible fix could be to add retry logic for
codes.Unavailableerrors ininitProxiedDriver(). The fix can be applied locally in cloud-provider-openstack without changes tocsi-lib-utils, by wrapping theProbeForevercall in a retry loop. For example:Alternatively, an upstream fix to
csi-lib-utilsProbeForevercould be proposed to also retry oncodes.Unavailable, since this is the expected transient error when a co-located CSI driver hasn't started yet.