Hi Luke,
I've noticed a couple of issues when starting a cluster in which one of the provided nodes is not available. The scenario is setting up a cluster with 2 nodes, one of the valid and another invalid one (just a node that might be down when my service restarts, for example).
The first one is related to the minimum number of connections. When the node connection manager starts, it's trying to create MinConnections in the node (that's ok, it's the expected behaviour). The problem is that if the node is not available, it may take up to ConnectTimeout to fail (I think by default is 30s). If you have let's say 10 min connections, that's a 5 min stop which isn't nice. Maybe the check loop (see link below pls) could be broken after the first error detected?
|
for i := uint16(0); i < cm.minConnections; i++ { |
|
conn, err := cm.create() |
|
if err == nil { |
|
cm.put(conn) |
|
} else { |
|
logErr("[connectionManager]", err) |
|
} |
|
} |
After that, no matter if it succeeds or not the node doesn't seem to be marked as failing (its state is set to running and to the eyes of the cluster is just another node ready to process requests). That takes me to the second issue. The cluster is ready to process requests so I send a couple of them to it.
The first one goes to the healthy node and works ok, but the second one is sent to the down node which blocks again for 30s (ConnectTimeout again, triggered by the connection manager trying to create a new connection when it couldn't get one from the pool).
The second attempt to create a connection kicks in the health checker (link below), which makes everything work as expected as nodes seem to be bypassing execution requests when they are in the nodeHealthChecking state.
|
conn, err := n.cm.get() |
|
if err != nil { |
|
logErr("[Node]", err) |
|
n.doHealthCheck() |
|
return false, err |
|
} |
So.. it would be great if health checker could be launched as soon as it's noticed that the connection is not ok during the cluster initialisation. Combining that with a smaller ConnectTimeout period I think everything would be ok.
Thank you very much in advance,
Best.-
Hi Luke,
I've noticed a couple of issues when starting a cluster in which one of the provided nodes is not available. The scenario is setting up a cluster with 2 nodes, one of the valid and another invalid one (just a node that might be down when my service restarts, for example).
The first one is related to the minimum number of connections. When the node connection manager starts, it's trying to create MinConnections in the node (that's ok, it's the expected behaviour). The problem is that if the node is not available, it may take up to ConnectTimeout to fail (I think by default is 30s). If you have let's say 10 min connections, that's a 5 min stop which isn't nice. Maybe the check loop (see link below pls) could be broken after the first error detected?
riak-go-client/connection_manager.go
Lines 107 to 114 in 315755e
After that, no matter if it succeeds or not the node doesn't seem to be marked as failing (its state is set to running and to the eyes of the cluster is just another node ready to process requests). That takes me to the second issue. The cluster is ready to process requests so I send a couple of them to it.
The first one goes to the healthy node and works ok, but the second one is sent to the down node which blocks again for 30s (ConnectTimeout again, triggered by the connection manager trying to create a new connection when it couldn't get one from the pool).
The second attempt to create a connection kicks in the health checker (link below), which makes everything work as expected as nodes seem to be bypassing execution requests when they are in the nodeHealthChecking state.
riak-go-client/node.go
Lines 159 to 164 in 315755e
So.. it would be great if health checker could be launched as soon as it's noticed that the connection is not ok during the cluster initialisation. Combining that with a smaller ConnectTimeout period I think everything would be ok.
Thank you very much in advance,
Best.-