Skip to content

Send multiple pings during large downloads, resetting self._ping_t#923

Open
JamesParrott wants to merge 1 commit into
eclipse-paho:masterfrom
JamesParrott:Do_not_self_disconnect_during_large_download
Open

Send multiple pings during large downloads, resetting self._ping_t#923
JamesParrott wants to merge 1 commit into
eclipse-paho:masterfrom
JamesParrott:Do_not_self_disconnect_during_large_download

Conversation

@JamesParrott

@JamesParrott JamesParrott commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Hi everyone, I'm very interested, but am not too invested in this PR - I'm keen to hear everyone's thoughts. All feedback is welcome. First and foremost, I think the intention of the spec to fix this problem (MQTT being a Lightweight messaging protocol) is that clients should pick a large enough keep alive interval, to cover the size of downloads they expect to receive on topics they're subscribed to. However subscribers don't control what others publish, and a broker may require a shorter keepalive than the client requested.

From local testing, on a mini slow network connection (between my dev laptop and a Beaglebone Black over a USB cable in my case), I can't speak for @Uezzi but I've found this PR fixes issue #914.

There are two separate checks that must be passed in loop_misc to avoid the client disconnecting itself (calling self._sock_close).

  • 1), if we are connected, but pingreq(s) have gone out and a ping response has not been received after the most recent one left (self._ping_t == 0), and either no other msg packets have been sent within the keep alive, or none have been received during the keep alive interval. If a pingreq has not gone out recently, this check also sends one.
  • 2) if a Ping Response has not been received within the keep alive interval of the last Ping request going out.

It looks to me like this second one is a straightforward implementation of "If a Client does not receive a PINGRESP packet within a reasonable amount of time after it has sent a PINGREQ, it SHOULD close the Network Connection to the Server." from the spec.

Within the first check however (you may have already spotted it) is an edge case that forms a grey area - what should the client do if a ping req has gone out, no other packets have gone out since during the keep alive interval, but in that time it's been happily downloading the partial msg packets that form another packet.

The server / broker may well have sent a PINGRESP, but currently on a slow network, for large downloads and short keep alive intervals, it is possible that the client does not read it in time, to avoid either self disconnection mechanism above. The connection may be fine, but the PINGRESP may simply be stuck behind too many unread partial packets of a large payload. About 100MB to 250MB can recreate precisely this situation with a 30 second keep alive.

In this situation, before the large, slow download completes, a Paho.mqtt.Python client will disconnect itself. Regardless of the keepalive values they set and/or received, some users many find this astonishing. The spec (v5) does state:

If a Client does not receive a PINGRESP packet within a reasonable amount of time after it has sent a PINGREQ, it SHOULD close the Network Connection to the Server.

The essence of this PR is to change the definition of "reasonable amount of time" from the keep alive interval, to "the maximum of the keep alive interval and however much time is needed to complete the current download" The maximum payload size is ~ 255MB, and as long as the network connection can pass traffic at an average speed v > 0, there is a finite upper bound of ~255 / v. for this time. Furthermore, it is likely not an unreasonable amount of time, in relation to the expectations and experience of users of slow networks.

This PR fixes this simply by altering current behaviour, to ping again, even if a prior ping request has not received a ping response. This also resets self._ping_t avoiding the second check.
The two new situations when (if this PR is accepted) we would now ping again, even when no ping response has been received since the most recent ping req:

  • i) If the last message out was too long ago (beyond keep alive), but the Client knows other packets have been received. In this case we ping again to avoid the server kicking the client for inactivity, resetting self._ping_t. The check whether the Client should disconnect due to no ping resp being received for the previous pingreq is left to 2) with the latest value of self._ping_t, once no packets have been received for longer than self._keepalive.
  • ii) if the last packet in was too long ago (beyond keep alive). A ping resp is a normal packet, so has not been received recently enough either. But this case will also be (or would have already been) handled by 2).

@pswsm

pswsm commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

After reading the MQTT specification, the part relating to the Keep Alive, this is still in-spec IMO. As you've stated, the spec says "SHOULD", and this is a valid reason to skip this recommendation, I think. I don't feel like I'm the most qualified guy to make the final decision, though.

@PierreF PierreF left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-read the MQTT spec and agree with your conclusion:

  • The client must send Ping request every keep-alive interval if not sending MQTT packet (we already did this, and continue to do this)
  • The client should disconnect after reasonable delay after sending pingreq if pingresp isn't received.
    • If data is actively receive, I do believe the reasonable delay could be multiple minutes

Maybe a infinite delay is a bit too much (well depend on how slow the download is, but at 1400 bytes every 10 seconds, 255 MiB is 22 days ;)), but I think have a secondary delay (ping non-responded pingresp) and secondary settings (max keep-alive delay during large download) feels like too complex but this corner case.

Comment thread .github/workflows/build.yml Outdated
on:
pull_request:
branches: [master]
branches: [master, Do_not_self_disconnect_during_large_download]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
branches: [master, Do_not_self_disconnect_during_large_download]
branches: [master]

(apply to other similar changes)

I'm not sure what the intend.

  1. it's shouldn't work (i.e. shouldn't change CI behavior on this PR), or else there is a security issue with the CI
  2. it's shouldn't be necessary to run CI jobs on PR. AFAIK they run because this is a PR targeting the branch master (and it's not your very first contribution)
  3. If this change is really needed, we can't hard-code the same of a branch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies. I should've reverted that branch name. Is it possible to relax the branches: [master] everywhere, so we can work on other branches on our forks?

@PierreF

PierreF commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I think following test should pass but don't... it might a bit a "test" issue, but unsure.

Your PR rely on _last_msg_in being updated "soon" after any bytes received on the socket, however during large download, it only get updated once 100 retry were done, and its much more likely to reach the BlockingIO error (only way to not reach it, is the network send data fast enough to always fill the socket buffer, but slow enough to hit the download-too-large issue).

I do think we should review a bit when _last_msg_in is updated... and maybe renamed it (or add new tracker) _last_bytes_received. It would be more clear that is means "any bytes received from network". _last_msg_in sound more like "last MQTT packet received" - which is not what we need for this PR.

The test I've try to write (I hope it's correct, but can't guarantee):

class TestLargeDownload:
    def test_no_disconnect_during_large_download(self, fake_broker: FakeBroker) -> None:
        mqttc = client.Client(
            CallbackAPIVersion.VERSION2,
            "test_no_disconnect_during_large_download",
            transport=fake_broker.transport,
        )

        payload_data = b"very Very Big big payload"
        is_connected = threading.Event()
        is_subscribed = threading.Event()
        publish_received = threading.Event()

        def on_connect(mqttc, obj, flags, rc, properties):
            assert rc == 0
            is_connected.set()

        def on_subscribe(cl, userdata, mid, reason_code_list, properties):
            is_subscribed.set()
        
        def on_message(cl, userdata, message):
            assert isinstance(cl, client.Client)
            assert isinstance(message, client.MQTTMessage)
            assert message.payload == payload_data
            publish_received.set()

        mqttc.on_connect = on_connect
        mqttc.on_subscribe = on_subscribe
        mqttc.on_message = on_message

        mqttc.enable_logger()
        mqttc.connect("localhost", fake_broker.port, keepalive=1)
        mqttc.loop_start()

        try:
            fake_broker.start()

            connect_packet = paho_test.gen_connect(
                "test_no_disconnect_during_large_download", keepalive=1)
            fake_broker.expect_packet("connect", connect_packet)

            connack_packet = paho_test.gen_connack(rc=0)
            count = fake_broker.send_packet(connack_packet)
            assert count == len(connack_packet)

            is_connected.wait(timeout=5)
            rc, mid = mqttc.subscribe("topic")
            assert rc == MQTTErrorCode.MQTT_ERR_SUCCESS
            assert mid is not None

            subscribe_packet = paho_test.gen_subscribe(mid, "topic", 0)
            fake_broker.expect_packet("subscribe", subscribe_packet)

            suback_packet = paho_test.gen_suback(2, 0)
            count = fake_broker.send_packet(suback_packet)
            assert count == len(suback_packet)

            is_subscribed.wait(timeout=5)

            publish_packet = paho_test.gen_publish("topic", 0, payload_data, mid=3)
            assert len(publish_packet) >= 32
            publish_packet_1 = publish_packet[:5] # Enough to cover the packet type
            publish_packet_2 = publish_packet[5:10]
            publish_packet_3 = publish_packet[10:15]
            publish_packet_4 = publish_packet[15:20]
            publish_packet_5 = publish_packet[20:25]
            publish_packet_6 = publish_packet[25:]

            count = fake_broker.send_packet(publish_packet_1)
            assert count == len(publish_packet_1)
            time.sleep(0.8)

            count = fake_broker.send_packet(publish_packet_2)
            assert count == len(publish_packet_2)
            time.sleep(0.8)

            # 1.6 seconds elapsed, client MUST have sent pingreq
            ping_req_packet = paho_test.gen_pingreq()
            fake_broker.expect_packet("pingreq", ping_req_packet)

            count = fake_broker.send_packet(publish_packet_3)
            assert count == len(publish_packet_3)
            time.sleep(0.8)

            count = fake_broker.send_packet(publish_packet_4)
            assert count == len(publish_packet_4)
            time.sleep(0.8)

            # Another 1.6 seconds elapsed, client MUST have sent another pingreq
            ping_req_packet = paho_test.gen_pingreq()
            fake_broker.expect_packet("2nd pingreq", ping_req_packet)

            count = fake_broker.send_packet(publish_packet_5)
            assert count == len(publish_packet_5)
            time.sleep(0.8)

            count = fake_broker.send_packet(publish_packet_6)
            assert count == len(publish_packet_6)

            publish_received.wait(timeout=5)
        finally:
            mqttc.loop_stop()
            fake_broker.finish()

@JamesParrott JamesParrott force-pushed the Do_not_self_disconnect_during_large_download branch from 4d2d15b to 6d46856 Compare June 17, 2026 09:09
Don't disconnect.  Send multiple pings during large downloads.  Only disconn if _last_mg_in outside keep alive

Don't close socket if _last_mg_in was within keep alive (don't rely on _ping_t only)

Send out pingreq, do not disconnect, if downloading, or pingreq overdue.

REvert loop misc but make it only disconnect on _last_msg_in, not _ping_t

Run workflows on this branch

This branch, not the other

Disconnect if no ping response, but if downloading ping again, resetting _ping_t

Run workflows on this branch

Only re-ping if we are downloading
@JamesParrott JamesParrott force-pushed the Do_not_self_disconnect_during_large_download branch from 6d46856 to 731607f Compare June 17, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants