Skip to content

[manila-csi-plugin] Node plugin fatally exits on startup when proxied CSI driver socket is not yet ready #3111

@mandre

Description

@mandre

What happened:

The Manila CSI node plugin crashes immediately on startup if its dependency, the proxied CSI driver socket (e.g., the NFS CSI plugin at /var/lib/kubelet/plugins/csi-nfsplugin/csi.sock), is not yet available. There is no retry logic for transient connection errors, causing a fatal exit instead of a graceful retry.

What you expected to happen:

The Manila CSI node plugin should retry connecting to the proxied CSI driver endpoint when it encounters transient connection errors (codes.Unavailable), similar to how it already retries on timeout errors (codes.DeadlineExceeded).

Root Cause Analysis:

In pkg/csi/manila/driver.go initProxiedDriver(), the driver creates a 15-second context and calls ProbeForever to probe the proxied CSI plugin socket. However, ProbeForever (from csi-lib-utils/rpc/common.go) only retries when the gRPC status code is DeadlineExceeded. When the socket returns "connection refused" or "no such file or directory", the gRPC status code is Unavailable, and ProbeForever immediately returns an error:

// csi-lib-utils/rpc/common.go
if st.Code() != codes.DeadlineExceeded {
    return fmt.Errorf("CSI driver probe failed: %s", err)
}

This error propagates up through initProxiedDriver()SetupNodeService()main.go which calls klog.Fatalf, terminating the process. The 15-second retry window is never actually used because only timeouts trigger a retry, not connection refusals.

The full error chain looks like:

F0511 02:55:43.317504  1 main.go:112] Driver node service initialization failed:
  failed to initialize proxied CSI driver: probe failed: CSI driver probe failed:
  rpc error: code = Unavailable desc = connection error: desc = "transport: Error
  while dialing: dial unix /var/lib/kubelet/plugins/csi-nfsplugin/csi.sock: connect:
  connection refused"

How to reproduce it:

  1. Deploy the Manila CSI driver with a proxied driver (e.g., NFS CSI plugin) on a Kubernetes/OpenShift cluster
  2. Reboot a node (or delete the proxied driver pod so its socket disappears)
  3. Both DaemonSet pods restart concurrently on the node
  4. The Manila CSI node plugin attempts to connect to the NFS socket before it's ready
  5. The Manila plugin fatally exits within ~1 second

Anything else we need to know?:

A possible fix could be to add retry logic for codes.Unavailable errors in initProxiedDriver(). The fix can be applied locally in cloud-provider-openstack without changes to csi-lib-utils, by wrapping the ProbeForever call in a retry loop. For example:

func (d *Driver) initProxiedDriver() (csiNodeCapabilitySet, error) {
	conn, err := d.csiClientBuilder.NewConnection(d.fwdEndpoint)
	if err != nil {
		return nil, fmt.Errorf("connecting to %s endpoint failed: %v", d.fwdEndpoint, err)
	}
	defer conn.Close()

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*15)
	defer cancel()

	identityClient := d.csiClientBuilder.NewIdentityServiceClient(conn)

	// Retry probing on transient connection errors (e.g., Unavailable)
	// since the proxied CSI driver may not be ready yet after a node restart.
	for {
		err = identityClient.ProbeForever(ctx, conn, time.Second*5)
		if err == nil {
			break
		}
		st, ok := status.FromError(err)
		if ok && st.Code() == codes.Unavailable {
			klog.Infof("Proxied CSI driver not yet available, retrying: %v", err)
			select {
			case <-ctx.Done():
				return nil, fmt.Errorf("timed out waiting for proxied CSI driver: %v", err)
			case <-time.After(time.Second):
				continue
			}
		}
		return nil, fmt.Errorf("probe failed: %v", err)
	}

	// ... rest of the function unchanged

Alternatively, an upstream fix to csi-lib-utils ProbeForever could be proposed to also retry on codes.Unavailable, since this is the expected transient error when a co-located CSI driver hasn't started yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions