Skip to content

Issues starting cluster when one of the provided nodes is not available [JIRA: CLIENTS-619] #30

@tegioz

Description

@tegioz

Hi Luke,

I've noticed a couple of issues when starting a cluster in which one of the provided nodes is not available. The scenario is setting up a cluster with 2 nodes, one of the valid and another invalid one (just a node that might be down when my service restarts, for example).

The first one is related to the minimum number of connections. When the node connection manager starts, it's trying to create MinConnections in the node (that's ok, it's the expected behaviour). The problem is that if the node is not available, it may take up to ConnectTimeout to fail (I think by default is 30s). If you have let's say 10 min connections, that's a 5 min stop which isn't nice. Maybe the check loop (see link below pls) could be broken after the first error detected?

for i := uint16(0); i < cm.minConnections; i++ {
conn, err := cm.create()
if err == nil {
cm.put(conn)
} else {
logErr("[connectionManager]", err)
}
}

After that, no matter if it succeeds or not the node doesn't seem to be marked as failing (its state is set to running and to the eyes of the cluster is just another node ready to process requests). That takes me to the second issue. The cluster is ready to process requests so I send a couple of them to it.

The first one goes to the healthy node and works ok, but the second one is sent to the down node which blocks again for 30s (ConnectTimeout again, triggered by the connection manager trying to create a new connection when it couldn't get one from the pool).

The second attempt to create a connection kicks in the health checker (link below), which makes everything work as expected as nodes seem to be bypassing execution requests when they are in the nodeHealthChecking state.

riak-go-client/node.go

Lines 159 to 164 in 315755e

conn, err := n.cm.get()
if err != nil {
logErr("[Node]", err)
n.doHealthCheck()
return false, err
}

So.. it would be great if health checker could be launched as soon as it's noticed that the connection is not ok during the cluster initialisation. Combining that with a smaller ConnectTimeout period I think everything would be ok.

Thank you very much in advance,
Best.-

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions