CBL-8071: Do not retry replicator connections if WebSocket being closed abnormally#2475
Conversation
…ed abnormally The Abnormal Close occurs after the client is already decided to close the socket but fails to receive response from the server. We don't want to retry replicating on this condition. We fix it by a change in WebSocket, that it now treats a connection closure that is not fully complete only due to a pending final server acknowledgement (i.e. CLOSE frame was already sent).
|
This is a release branch and commits are restricted. Please confirm this PR is one of the following:
|
|
Code Coverage Results:
|
There was a problem hiding this comment.
Pull request overview
Adjusts WebSocket close-status mapping to avoid treating an in-progress client-initiated close (CLOSE sent, no peer ACK) as an “abnormal” close that would trigger replicator retry behavior.
Changes:
- Update
WebSocketImpl::onCloseto reportkCodeNormal(instead ofkCodeAbnormal) when the close was initiated by the client (_closeSent) but not fully acknowledged (!_closeReceived) and the underlying disconnect is otherwise considered clean. - Add a replicator API test covering “peer going away” both before CLOSE is sent and after CLOSE is sent.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| Replicator/tests/ReplicatorAPITest.cc | Adds a regression test for peer “going away” behavior before/after a client-initiated close. |
| Networking/WebSockets/WebSocketImpl.cc | Changes close code mapping to avoid reporting abnormal close when CLOSE was already sent by the client. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // if (!expected) then _closeSent => !_closeReceived | ||
| if ( !expected ) status.code = _closeSent ? kCodeNormal : kCodeAbnormal; | ||
| else if ( !_closeMessage ) |
There was a problem hiding this comment.
The inline comment is logically incorrect: !expected does not imply _closeSent => !_closeReceived (it can also mean neither CLOSE was sent nor received). Consider updating the comment and making the intent explicit in code (e.g., branch specifically on _closeSent && !_closeReceived for the "client already initiated close" case, otherwise keep abnormal).
| // if (!expected) then _closeSent => !_closeReceived | |
| if ( !expected ) status.code = _closeSent ? kCodeNormal : kCodeAbnormal; | |
| else if ( !_closeMessage ) | |
| if ( !expected ) { | |
| // Treat only the "client already initiated close, but no CLOSE was received" | |
| // case as normal; all other unexpected close states remain abnormal. | |
| status.code = (_closeSent && !_closeReceived) ? kCodeNormal : kCodeAbnormal; | |
| } else if ( !_closeMessage ) |
There was a problem hiding this comment.
I agreed with the copilot comment that the comment is a bit confusing. I have also mistakenly think that it's an obsolute code comment that got left behind.
| // Give some time for Replicator::_stop to be called, but before timeout in WebSocketImpl | ||
| // to not get Timeout error. | ||
| std::this_thread::sleep_for(1s); | ||
| // WebSocket will treat it as Normal Close | ||
| c4socket_closed(c4socket, {WebSocketDomain, websocket::kCodeGoingAway}); |
There was a problem hiding this comment.
The test relies on a fixed sleep_for(1s) to ensure the CLOSE frame has been sent before calling c4socket_closed. This is timing-dependent and can be flaky on slow/loaded CI machines. Prefer synchronizing on an observable event (e.g., have factory.write signal when it sees the CLOSE write after c4repl_stop, then trigger c4socket_closed), or poll with a bounded timeout until the expected write happens.
There was a problem hiding this comment.
I could agree with this comment. For a quick mitigate, I think we could bump the timeout a little bit. However, if we are not seeing the test to be flaky anywhere, we can leave it as is.
| // Give some time for Replicator::_stop to be called, but before timeout in WebSocketImpl | ||
| // to not get Timeout error. | ||
| std::this_thread::sleep_for(1s); | ||
| // WebSocket will treat it as Normal Close | ||
| c4socket_closed(c4socket, {WebSocketDomain, websocket::kCodeGoingAway}); |
There was a problem hiding this comment.
I could agree with this comment. For a quick mitigate, I think we could bump the timeout a little bit. However, if we are not seeing the test to be flaky anywhere, we can leave it as is.
| // if (!expected) then _closeSent => !_closeReceived | ||
| if ( !expected ) status.code = _closeSent ? kCodeNormal : kCodeAbnormal; | ||
| else if ( !_closeMessage ) |
There was a problem hiding this comment.
I agreed with the copilot comment that the comment is a bit confusing. I have also mistakenly think that it's an obsolute code comment that got left behind.
|
This is a release branch and commits are restricted. Please confirm this PR is one of the following:
|
The Abnormal Close occurs after the client is already decided to close the socket but fails to receive response from the server. We don't want to retry replicating on this condition.
We fix it by a change in WebSocket, that it now treats a connection closure that is not fully complete only due to a pending final server acknowledgement (i.e. CLOSE frame was already sent).