fix(gossipsub): remove closed outbound streams from the registry#3531
Conversation
A closed outbound stream was left in `streamsOutbound`, which blocked every further outbound message to that peer for the life of the connection. Remove it when it closes so a new stream can be created.
Reconcile outbound streams each heartbeat so a connected peer regains an outbound stream after the previous one closed on a still-open connection.
|
Going to remove the heartbeat reconnect. Feeling its too aggressive. Could be good to have in the future. |
…tbeat" This reverts commit 5711dda.
|
I am curious if there will be any ongoing work related to the scope that was not performed in this PR, "Scope: this restores the stream when the peer reconnects, which is the reported case. Re-establishing a stream that closes while its connection stays open, where the remote never re-opens, is left as a follow-up." I was working on a patch for gossipsub not handling the stream close event and having a dead stream in the list and found your fix for that which i have incorporated. I was also working toward not long handling the error but actively re-establishing the stream or disconnecting the stream. I am not sure waiting for he peer to reconnect should be the only strategy but I am not sure about that. Could you elaborate on any future plans to work on active re-establishment of the failed stream? |
|
Hey @larrype8, I'm not convinced active re-dialing in gossipsub is necessary yet. Most of the problem this PR is solving is a case where a device goes offline, comes back sometime later, and then tries to dial the same gossipsub peer again (so we know the gossipsub peer is already being dialed). But if you experience a problem in implementation, please do create an issue. Thanks |
|
Thanks for the reply. The main problem we experienced was the one you fixed in this PR. I will monitor it and log an issue if we experience a problem getting the peer reconnected with the exact condition because I am also not sure how it will behave with this fix in place. Thanks again. |
Description
When an outbound stream closed (reset, transport drop, remote close), the closed stream was left in
streamsOutbound. ThecreateOutboundStreamguard then short-circuited on the stale entry, so when the peer reconnected no new outbound stream was created and every further outbound message to that peer failed.Remove an outbound stream from the registry when it closes, so a new one is created the next time the peer (re)connects.
Related #2771.
References ChainSafe/js-libp2p-gossipsub#534, which fixes the same issue in the pre-migration repository.
Notes & open questions
Scope: this restores the stream when the peer reconnects, which is the reported case. Re-establishing a stream that closes while its connection stays open, where the remote never re-opens, is left as a follow-up.
Change checklist