fix(devmgr): auto-reconnect stale LAN MQTT and recover startup race#10767
fix(devmgr): auto-reconnect stale LAN MQTT and recover startup race#10767adele-with-a-b wants to merge 1 commit into
Conversation
|
The macOS / Linux / Windows CI failures here aren't caused by the changes in this PR. The diff under The failure is in The umbrella fix is already filed as #10717 ( The changes in this PR are confined to |
|
Update on the CI failures: the Linux failures ( The same error appears on every open PR's Linux build for the past 2+ weeks — see #10712's CI for an unrelated example. So the full CI picture for this PR's diff is: macOS and Windows red on the Assimp/zlib |
LAN-mode-only printers (no Bambu cloud login) had two related failure modes that left the user manually re-selecting the printer to make Studio talk to it again. This commit fixes both -- they share the same root surface (DeviceManager + TryLoadLastMachine), affect the same users, and reproduce on the same workflow, so they ship together. == Stale MQTT socket recovery (DeviceManagerRefresher::on_timer) == After idling, macOS App Nap, the local network stack, or the printer can silently drop the MQTT-over-TLS TCP session. The next publish_gcode() returns BAMBU_NETWORK_ERR_SEND_MSG_FAILED (-4) and the user has to manually re-select the printer to trigger the disconnect+reconnect path in DeviceManager::set_selected_machine. on_timer's existing keep_alive() / refresh_connection() calls are gated on is_user_login() and never run for LAN-only users. Add a parallel branch that fires when all of these hold: - obj->is_lan_mode_printer() && obj->has_access_right() - obj->is_avaliable() (bind_state == "free") - !obj->is_in_printing() (don't clobber print UI mid-print) - !obj->is_connected() (last MQTT push older than 30s) When the gate passes, re-select the same machine id. That triggers the same-id-LAN branch in set_selected_machine which runs disconnect_printer -> reset -> connect -- the same path the manual workaround takes. Throttled to one attempt per 10s, bumped only on a successful set_selected_machine so a transient false return doesn't delay the next chance to recover. == Startup-race recovery (TryLoadLastMachine via SSDP) == TryLoadLastMachine::InnerLoad fires within milliseconds of app start, before SSDP has announced the printer's current IP. If the cached user_access_dev_ip is stale (slicer_uuid rotated since pairing, or DHCP gave the printer a new IP), bind_detect returns -2 immediately, erases user_access_dev_ip, and bails. The cloud fallback also fails because the LAN printer isn't in the list yet. By the time SSDP populates localMachineList ~1-3s later, no further InnerLoad retry runs and the printer is discovered-but-not-selected (see upstream issue bambulab#9445). Add GUI_App::try_load_last_machine_on_alive(dev_id) and call it from DeviceManager::on_machine_alive whenever an SSDP packet announces a previously-paired printer. The retry's InnerLoad finds user_access_dev_ip empty (the failed first attempt erased it) and falls through to the dev->get_my_machine non-null branch in GUI_App.cpp, which calls set_selected_machine directly. No second bind_detect is spawned. The method self-filters on dev_id == get_user_last_machine() and no-ops if a machine is already selected, so per-SSDP-packet invocation is cheap. == Test plan == Stale-MQTT: idle the app >30min on macOS with a LAN-only printer. Without patch, Send to Printer returns -4; with patch, the next 1Hz refresher tick auto-reconnects and the send succeeds. Verify the log line "LAN auto-reconnect: stale MQTT socket detected for dev_id=...". Startup race: with a stale user_access_dev_ip (rotate slicer_uuid in BambuStudio.conf, or power-cycle the router so SSDP is delayed), launch Studio. Without patch the printer is never selected; with patch the SSDP packet triggers a retry and selection succeeds. Verify "try_load_last_machine_on_alive: SSDP-triggered retry for ...". == Limitations == - After studio inactivity (>15min), the first user action after wake may still see one failed send before stale-MQTT auto- reconnect runs on the next refresher tick. A subsequent send within 1s succeeds. - The 10s throttle is a function-static, not per-dev-id, so in multi-printer households a recent reconnect attempt on printer A can delay the next attempt on printer B by up to 10s. - TryLoadLastMachine's destructor joins local_bind_thread; an SSDP-triggered InnerLoad firing during app shutdown can stall the join 1-3s waiting for the bind_detect timeout. Addresses upstream issue bambulab#9445.
0c065da to
657b271
Compare
|
Hi @adele-with-a-b! The macOS CI failures in this PR are caused by a pre-existing infrastructure issue unrelated to your changes. Root cause: Assimp 5.4.3's bundled zlib defines I've submitted a fix in #10849 that disables the bundled zlib on macOS (system zlib is always available there) while keeping it enabled on Windows. Once that merges, this PR should pass CI cleanly. |
… 15+ Assimp 5.4.3's bundled zlib defines `#define fdopen(fd,mode) NULL` in contrib/zlib/zutil.h:147. On macOS 15+ the SDK's `_stdio.h` uses `__DARWIN_ALIAS(fdopen)`, which expands through that macro and causes a C preprocessor parse error, breaking the "Build Deps" CI step on both arm64 and x86_64. Fix: only build the bundled zlib on Windows (where it is needed); on macOS and Linux use the system-provided zlib instead. Fixes the CI failures in bambulab#10767. Co-Authored-By: Abdel Gomez-Perez <nabdel07@icloud.com>
… 15+ Assimp 5.4.3's bundled zlib defines `#define fdopen(fd,mode) NULL` in contrib/zlib/zutil.h:147. On macOS 15+ the SDK's `_stdio.h` uses `__DARWIN_ALIAS(fdopen)`, which expands through that macro and causes a C preprocessor parse error, breaking the "Build Deps" CI step on both arm64 and x86_64. Fix: only build the bundled zlib on Windows (where it is needed); on macOS and Linux use the system-provided zlib instead. Fixes the CI failures in bambulab#10767. Co-Authored-By: Abdel Gomez-Perez <nabdel07@icloud.com>
Summary
LAN-only users (no Bambu cloud account) routinely hit two reconnection failures that leave the printer effectively unusable until the user manually clicks the printer in the Devices tab. This PR fixes both with a single bundled commit, since they share the same code surface (
DeviceManager+TryLoadLastMachine), affect the same users, reproduce on the same workflow, and interact at runtime.Closes #9445 (still actively reported in May 2026 on macOS — see comment from harmstorf).
Failure 1 — Stale MQTT socket after idle
Repro: Open BambuStudio, confirm the printer is connected. Walk away for 30+ minutes. Come back, click Send to Printer. Result:
BAMBU_NETWORK_ERR_SEND_MSG_FAILED(-4).Root cause: macOS App Nap, the network stack, or the printer-side keepalive can silently tear down the MQTT-over-TLS socket while the app is idle. The next
publish_gcode()writes to a dead socket. The 1HzDeviceManagerRefresher::on_timerexists to keep MQTT alive but itscheck_pushing()/refresh_connection()calls are gated onis_user_login(). For LAN-only users that flag is never set true, so keep-alive never runs.Fix: Add a parallel branch in
on_timerthat detects the stale-LAN-MQTT condition (is_lan_mode_printer && has_access_right && is_studio_active && !is_connected) and, when all four hold, re-selects the same machine ID. That triggers the same-id-LAN branch inset_selected_machine(DevManager.cpp:478-492), which runsdisconnect_printer → reset → connect— the exact path the manual workaround takes. Throttled to one attempt per 10 s to avoid hammering on persistent failures (powered-off printer).Failure 2 — Stale
user_access_dev_ipon cold startRepro: Restart BambuStudio. Result: printer doesn't show as selected, even though it's on the network and was bound before.
Root cause:
TryLoadLastMachine::InnerLoadruns at startup (right afteragent->start()) and triesbind_detect(decoded_user_access_dev_ip, ...). If the cached IP is stale — slicer_uuid was rotated since pairing so the encoded IP can't decode, OR the printer's DHCP IP changed —bind_detectfails immediately with-2, erasesuser_access_dev_ip, and bails. The cloud fallback then runs but also fails because the LAN printer isn't inlocalMachineListyet. SSDP populates the local list 1-3 seconds later, but no furtherInnerLoadretry happens. The printer sits inlocalMachineListbut is never selected.Fix: Add
GUI_App::try_load_last_machine_on_alive(dev_id)and call it fromDeviceManager::on_machine_alivewhenever an SSDP packet announces a previously-paired printer. The method self-filters ondev_id == get_user_last_machine()and no-ops if a machine is already selected, so calling it on every SSDP packet is cheap. By the time SSDP arrives, the printer is inlocalMachineList, so the secondInnerLoad'sset_selected_machinesucceeds via the same-id-LAN branch.Test plan
Stale-MQTT recovery:
-4. With patch: the next 1Hz refresher tick auto-reconnects (within ~1 s) and the send succeeds.LAN auto-reconnect: stale MQTT socket detected for dev_id=....Startup-race recovery:
slicer_uuidinBambuStudio.confto invalidate the encodeduser_access_dev_ip, OR power-cycle the router so SSDP is delayed at next launch.InnerLoadretry; the printer is selected automatically.try_load_last_machine_on_alive: SSDP-triggered retry for ...andset_selected_machine: select new lan machine ....Limitations
After studio inactivity (>15 min), the first user action after wake may still see one failed send before stale-MQTT auto-reconnect runs on the next refresher tick. A subsequent send within 1 s succeeds.
Tested on