[manila-csi-plugin] Node plugin fatally exits on startup when proxied CSI driver socket is not yet ready

**What happened**:

The Manila CSI node plugin crashes immediately on startup if its dependency, the proxied CSI driver socket (e.g., the NFS CSI plugin at `/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock`), is not yet available. There is no retry logic for transient connection errors, causing a fatal exit instead of a graceful retry.

**What you expected to happen**:

The Manila CSI node plugin should retry connecting to the proxied CSI driver endpoint when it encounters transient connection errors (`codes.Unavailable`), similar to how it already retries on timeout errors (`codes.DeadlineExceeded`).

**Root Cause Analysis**:

In [`pkg/csi/manila/driver.go` `initProxiedDriver()`](https://github.com/kubernetes/cloud-provider-openstack/blob/46904da2e07ff10b1ec626150fe0b4c96fdbf162/pkg/csi/manila/driver.go#L265-L294), the driver creates a 15-second context and calls `ProbeForever` to probe the proxied CSI plugin socket. However, `ProbeForever` (from [`csi-lib-utils/rpc/common.go`](https://github.com/kubernetes-csi/csi-lib-utils/blob/4d601c285c808f9bbe76956ac3fe491d4a774b07/rpc/common.go#L134-L170)) only retries when the gRPC status code is `DeadlineExceeded`. When the socket returns "connection refused" or "no such file or directory", the gRPC status code is `Unavailable`, and `ProbeForever` immediately returns an error:

```go
// csi-lib-utils/rpc/common.go
if st.Code() != codes.DeadlineExceeded {
    return fmt.Errorf("CSI driver probe failed: %s", err)
}
```

This error propagates up through `initProxiedDriver()` → `SetupNodeService()` → `main.go` which calls `klog.Fatalf`, terminating the process. The 15-second retry window is never actually used because only timeouts trigger a retry, not connection refusals.

The full error chain looks like:

```
F0511 02:55:43.317504  1 main.go:112] Driver node service initialization failed:
  failed to initialize proxied CSI driver: probe failed: CSI driver probe failed:
  rpc error: code = Unavailable desc = connection error: desc = "transport: Error
  while dialing: dial unix /var/lib/kubelet/plugins/csi-nfsplugin/csi.sock: connect:
  connection refused"
```

**How to reproduce it**:

1. Deploy the Manila CSI driver with a proxied driver (e.g., NFS CSI plugin) on a Kubernetes/OpenShift cluster
2. Reboot a node (or delete the proxied driver pod so its socket disappears)
3. Both DaemonSet pods restart concurrently on the node
4. The Manila CSI node plugin attempts to connect to the NFS socket before it's ready
5. The Manila plugin fatally exits within ~1 second


**Anything else we need to know?**:

A possible fix could be to add retry logic for `codes.Unavailable` errors in `initProxiedDriver()`. The fix can be applied locally in cloud-provider-openstack without changes to `csi-lib-utils`, by wrapping the `ProbeForever` call in a retry loop. For example:

```go
func (d *Driver) initProxiedDriver() (csiNodeCapabilitySet, error) {
	conn, err := d.csiClientBuilder.NewConnection(d.fwdEndpoint)
	if err != nil {
		return nil, fmt.Errorf("connecting to %s endpoint failed: %v", d.fwdEndpoint, err)
	}
	defer conn.Close()

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*15)
	defer cancel()

	identityClient := d.csiClientBuilder.NewIdentityServiceClient(conn)

	// Retry probing on transient connection errors (e.g., Unavailable)
	// since the proxied CSI driver may not be ready yet after a node restart.
	for {
		err = identityClient.ProbeForever(ctx, conn, time.Second*5)
		if err == nil {
			break
		}
		st, ok := status.FromError(err)
		if ok && st.Code() == codes.Unavailable {
			klog.Infof("Proxied CSI driver not yet available, retrying: %v", err)
			select {
			case <-ctx.Done():
				return nil, fmt.Errorf("timed out waiting for proxied CSI driver: %v", err)
			case <-time.After(time.Second):
				continue
			}
		}
		return nil, fmt.Errorf("probe failed: %v", err)
	}

	// ... rest of the function unchanged
```

Alternatively, an upstream fix to `csi-lib-utils` `ProbeForever` could be proposed to also retry on `codes.Unavailable`, since this is the expected transient error when a co-located CSI driver hasn't started yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manila-csi-plugin] Node plugin fatally exits on startup when proxied CSI driver socket is not yet ready #3111

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[manila-csi-plugin] Node plugin fatally exits on startup when proxied CSI driver socket is not yet ready #3111

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions