Send multiple pings during large downloads, resetting self._ping_t#923
Conversation
|
After reading the MQTT specification, the part relating to the Keep Alive, this is still in-spec IMO. As you've stated, the spec says "SHOULD", and this is a valid reason to skip this recommendation, I think. I don't feel like I'm the most qualified guy to make the final decision, though. |
PierreF
left a comment
There was a problem hiding this comment.
I've re-read the MQTT spec and agree with your conclusion:
- The client must send Ping request every keep-alive interval if not sending MQTT packet (we already did this, and continue to do this)
- The client should disconnect after reasonable delay after sending pingreq if pingresp isn't received.
- If data is actively receive, I do believe the reasonable delay could be multiple minutes
Maybe a infinite delay is a bit too much (well depend on how slow the download is, but at 1400 bytes every 10 seconds, 255 MiB is 22 days ;)), but I think have a secondary delay (ping non-responded pingresp) and secondary settings (max keep-alive delay during large download) feels like too complex but this corner case.
| on: | ||
| pull_request: | ||
| branches: [master] | ||
| branches: [master, Do_not_self_disconnect_during_large_download] |
There was a problem hiding this comment.
| branches: [master, Do_not_self_disconnect_during_large_download] | |
| branches: [master] |
(apply to other similar changes)
I'm not sure what the intend.
- it's shouldn't work (i.e. shouldn't change CI behavior on this PR), or else there is a security issue with the CI
- it's shouldn't be necessary to run CI jobs on PR. AFAIK they run because this is a PR targeting the branch master (and it's not your very first contribution)
- If this change is really needed, we can't hard-code the same of a branch
There was a problem hiding this comment.
Apologies. I should've reverted that branch name. Is it possible to relax the branches: [master] everywhere, so we can work on other branches on our forks?
|
I think following test should pass but don't... it might a bit a "test" issue, but unsure. Your PR rely on I do think we should review a bit when The test I've try to write (I hope it's correct, but can't guarantee): |
4d2d15b to
6d46856
Compare
Don't disconnect. Send multiple pings during large downloads. Only disconn if _last_mg_in outside keep alive Don't close socket if _last_mg_in was within keep alive (don't rely on _ping_t only) Send out pingreq, do not disconnect, if downloading, or pingreq overdue. REvert loop misc but make it only disconnect on _last_msg_in, not _ping_t Run workflows on this branch This branch, not the other Disconnect if no ping response, but if downloading ping again, resetting _ping_t Run workflows on this branch Only re-ping if we are downloading
6d46856 to
731607f
Compare
Hi everyone, I'm very interested, but am not too invested in this PR - I'm keen to hear everyone's thoughts. All feedback is welcome. First and foremost, I think the intention of the spec to fix this problem (MQTT being a Lightweight messaging protocol) is that clients should pick a large enough keep alive interval, to cover the size of downloads they expect to receive on topics they're subscribed to. However subscribers don't control what others publish, and a broker may require a shorter keepalive than the client requested.
From local testing, on a mini slow network connection (between my dev laptop and a Beaglebone Black over a USB cable in my case), I can't speak for @Uezzi but I've found this PR fixes issue #914.
There are two separate checks that must be passed in
loop_miscto avoid the client disconnecting itself (calling self._sock_close).It looks to me like this second one is a straightforward implementation of "If a Client does not receive a PINGRESP packet within a reasonable amount of time after it has sent a PINGREQ, it SHOULD close the Network Connection to the Server." from the spec.
Within the first check however (you may have already spotted it) is an edge case that forms a grey area - what should the client do if a ping req has gone out, no other packets have gone out since during the keep alive interval, but in that time it's been happily downloading the partial msg packets that form another packet.
The server / broker may well have sent a PINGRESP, but currently on a slow network, for large downloads and short keep alive intervals, it is possible that the client does not read it in time, to avoid either self disconnection mechanism above. The connection may be fine, but the PINGRESP may simply be stuck behind too many unread partial packets of a large payload. About 100MB to 250MB can recreate precisely this situation with a 30 second keep alive.
In this situation, before the large, slow download completes, a Paho.mqtt.Python client will disconnect itself. Regardless of the keepalive values they set and/or received, some users many find this astonishing. The spec (v5) does state:
The essence of this PR is to change the definition of "reasonable amount of time" from the keep alive interval, to "the maximum of the keep alive interval and however much time is needed to complete the current download" The maximum payload size is ~ 255MB, and as long as the network connection can pass traffic at an average speed v > 0, there is a finite upper bound of ~255 / v. for this time. Furthermore, it is likely not an unreasonable amount of time, in relation to the expectations and experience of users of slow networks.
This PR fixes this simply by altering current behaviour, to ping again, even if a prior ping request has not received a ping response. This also resets
self._ping_tavoiding the second check.The two new situations when (if this PR is accepted) we would now ping again, even when no ping response has been received since the most recent ping req: