Skip to content

fix(devmgr): auto-reconnect stale LAN MQTT and recover startup race#10767

Open
adele-with-a-b wants to merge 1 commit into
bambulab:masterfrom
adele-with-a-b:fix/lan-stale-mqtt
Open

fix(devmgr): auto-reconnect stale LAN MQTT and recover startup race#10767
adele-with-a-b wants to merge 1 commit into
bambulab:masterfrom
adele-with-a-b:fix/lan-stale-mqtt

Conversation

@adele-with-a-b
Copy link
Copy Markdown
Contributor

@adele-with-a-b adele-with-a-b commented May 18, 2026

Summary

LAN-only users (no Bambu cloud account) routinely hit two reconnection failures that leave the printer effectively unusable until the user manually clicks the printer in the Devices tab. This PR fixes both with a single bundled commit, since they share the same code surface (DeviceManager + TryLoadLastMachine), affect the same users, reproduce on the same workflow, and interact at runtime.

Closes #9445 (still actively reported in May 2026 on macOS — see comment from harmstorf).

Failure 1 — Stale MQTT socket after idle

Repro: Open BambuStudio, confirm the printer is connected. Walk away for 30+ minutes. Come back, click Send to Printer. Result: BAMBU_NETWORK_ERR_SEND_MSG_FAILED (-4).

Root cause: macOS App Nap, the network stack, or the printer-side keepalive can silently tear down the MQTT-over-TLS socket while the app is idle. The next publish_gcode() writes to a dead socket. The 1Hz DeviceManagerRefresher::on_timer exists to keep MQTT alive but its check_pushing() / refresh_connection() calls are gated on is_user_login(). For LAN-only users that flag is never set true, so keep-alive never runs.

Fix: Add a parallel branch in on_timer that detects the stale-LAN-MQTT condition (is_lan_mode_printer && has_access_right && is_studio_active && !is_connected) and, when all four hold, re-selects the same machine ID. That triggers the same-id-LAN branch in set_selected_machine (DevManager.cpp:478-492), which runs disconnect_printer → reset → connect — the exact path the manual workaround takes. Throttled to one attempt per 10 s to avoid hammering on persistent failures (powered-off printer).

Failure 2 — Stale user_access_dev_ip on cold start

Repro: Restart BambuStudio. Result: printer doesn't show as selected, even though it's on the network and was bound before.

Root cause: TryLoadLastMachine::InnerLoad runs at startup (right after agent->start()) and tries bind_detect(decoded_user_access_dev_ip, ...). If the cached IP is stale — slicer_uuid was rotated since pairing so the encoded IP can't decode, OR the printer's DHCP IP changed — bind_detect fails immediately with -2, erases user_access_dev_ip, and bails. The cloud fallback then runs but also fails because the LAN printer isn't in localMachineList yet. SSDP populates the local list 1-3 seconds later, but no further InnerLoad retry happens. The printer sits in localMachineList but is never selected.

Fix: Add GUI_App::try_load_last_machine_on_alive(dev_id) and call it from DeviceManager::on_machine_alive whenever an SSDP packet announces a previously-paired printer. The method self-filters on dev_id == get_user_last_machine() and no-ops if a machine is already selected, so calling it on every SSDP packet is cheap. By the time SSDP arrives, the printer is in localMachineList, so the second InnerLoad's set_selected_machine succeeds via the same-id-LAN branch.

Test plan

Stale-MQTT recovery:

  1. Pair a LAN-only printer, confirm connection in Devices tab.
  2. Idle the app for 30+ minutes (lock screen, close laptop lid).
  3. Wake the app. Click Send to Printer.
  4. Without patch: returns -4. With patch: the next 1Hz refresher tick auto-reconnects (within ~1 s) and the send succeeds.
  5. Verify log line LAN auto-reconnect: stale MQTT socket detected for dev_id=....

Startup-race recovery:

  1. Pair a LAN-only printer.
  2. Either rotate slicer_uuid in BambuStudio.conf to invalidate the encoded user_access_dev_ip, OR power-cycle the router so SSDP is delayed at next launch.
  3. Restart BambuStudio.
  4. Without patch: printer is in the device list (post-SSDP) but never selected. With patch: the SSDP packet triggers InnerLoad retry; the printer is selected automatically.
  5. Verify log lines try_load_last_machine_on_alive: SSDP-triggered retry for ... and set_selected_machine: select new lan machine ....

Limitations

After studio inactivity (>15 min), the first user action after wake may still see one failed send before stale-MQTT auto-reconnect runs on the next refresher tick. A subsequent send within 1 s succeeds.

Tested on

  • macOS 26.4.1, Apple Silicon, against a Bambu H2C in LAN-only mode (developer mode enabled).
  • Both fixes confirmed via session log analysis on a real reproduction.

@adele-with-a-b
Copy link
Copy Markdown
Contributor Author

The macOS / Linux / Windows CI failures here aren't caused by the changes in this PR. The diff under deps/ is empty:

$ git diff --stat upstream/master..fix/lan-stale-mqtt
 src/slic3r/GUI/DeviceCore/DevManager.cpp | 90 +++++++++++++++++++++++++++++++
 src/slic3r/GUI/GUI_App.cpp               | 26 +++++++++
 src/slic3r/GUI/GUI_App.hpp               | 13 +++++

The failure is in deps/build/.../dep_Assimp-prefix/src/dep_Assimp/contrib/zlib/zutil.c, where zutil.h:147 defines fdopen(fd,mode) → NULL and the substituted NULL collides with macOS SDK's fdopen declaration in <_stdio.h>:318 (__DARWIN_ALIAS_STARTING(...) doesn't tolerate a macro-expanded parameter name). This is reproducible across every open macOS-touching PR — see for example #10712's CI, which fails the exact same way on completely unrelated code.

The umbrella fix is already filed as #10717 (-DASSIMP_BUILD_ZLIB=OFF plus four other related issues). I hit the same compile error locally (Apple Clang 21 + macOS SDK 26) and the ASSIMP_BUILD_ZLIB=OFF workaround landed cleanly — same fix from Karlingen's PR.

The changes in this PR are confined to DeviceManager and TryLoadLastMachine, so I'd expect them to build green once #10717 (or the equivalent) lands and CI is unblocked.

@adele-with-a-b
Copy link
Copy Markdown
Contributor Author

Update on the CI failures: the Linux failures (Build BambuStudio step on ubuntu-22.04 / 24.04) are also unrelated to this PR's diff. They fail at AMSMaterialsSetting.cpp:1168 with ‘class Slic3r::MachineObject’ has no member named ‘get_extruder_id_by_ams_id’ — the same regression that's covered by item 5 of #10717 and is the focus of #10768. Easy to confirm by reading the same line in this PR's diff (it's not there — PR diff is 3 files in src/slic3r/GUI/, none of them AMSMaterialsSetting.cpp):

$ git diff --stat upstream/master..fix/lan-stale-mqtt
 src/slic3r/GUI/DeviceCore/DevManager.cpp | 90 ++++++++++++++++++++++++++++++++
 src/slic3r/GUI/GUI_App.cpp               | 26 +++++++++
 src/slic3r/GUI/GUI_App.hpp               | 13 +++++

The same error appears on every open PR's Linux build for the past 2+ weeks — see #10712's CI for an unrelated example. So the full CI picture for this PR's diff is: macOS and Windows red on the Assimp/zlib fdopen → NULL collision, Linux red on the get_extruder_id_by_ams_id regression, both upstream-wide. The PR's actual code in DeviceManager and TryLoadLastMachine should compile green once #10717 (or #10768) lands and CI is unblocked.

LAN-mode-only printers (no Bambu cloud login) had two related failure
modes that left the user manually re-selecting the printer to make
Studio talk to it again. This commit fixes both -- they share the same
root surface (DeviceManager + TryLoadLastMachine), affect the same
users, and reproduce on the same workflow, so they ship together.

== Stale MQTT socket recovery (DeviceManagerRefresher::on_timer) ==

After idling, macOS App Nap, the local network stack, or the printer
can silently drop the MQTT-over-TLS TCP session. The next
publish_gcode() returns BAMBU_NETWORK_ERR_SEND_MSG_FAILED (-4) and the
user has to manually re-select the printer to trigger the
disconnect+reconnect path in DeviceManager::set_selected_machine.

on_timer's existing keep_alive() / refresh_connection() calls are
gated on is_user_login() and never run for LAN-only users. Add a
parallel branch that fires when all of these hold:

  - obj->is_lan_mode_printer() && obj->has_access_right()
  - obj->is_avaliable()       (bind_state == "free")
  - !obj->is_in_printing()    (don't clobber print UI mid-print)
  - !obj->is_connected()      (last MQTT push older than 30s)

When the gate passes, re-select the same machine id. That triggers
the same-id-LAN branch in set_selected_machine which runs
disconnect_printer -> reset -> connect -- the same path the manual
workaround takes. Throttled to one attempt per 10s, bumped only on a
successful set_selected_machine so a transient false return doesn't
delay the next chance to recover.

== Startup-race recovery (TryLoadLastMachine via SSDP) ==

TryLoadLastMachine::InnerLoad fires within milliseconds of app start,
before SSDP has announced the printer's current IP. If the cached
user_access_dev_ip is stale (slicer_uuid rotated since pairing, or
DHCP gave the printer a new IP), bind_detect returns -2 immediately,
erases user_access_dev_ip, and bails. The cloud fallback also fails
because the LAN printer isn't in the list yet. By the time SSDP
populates localMachineList ~1-3s later, no further InnerLoad retry
runs and the printer is discovered-but-not-selected (see upstream
issue bambulab#9445).

Add GUI_App::try_load_last_machine_on_alive(dev_id) and call it from
DeviceManager::on_machine_alive whenever an SSDP packet announces a
previously-paired printer. The retry's InnerLoad finds
user_access_dev_ip empty (the failed first attempt erased it) and
falls through to the dev->get_my_machine non-null branch in
GUI_App.cpp, which calls set_selected_machine directly. No second
bind_detect is spawned. The method self-filters on dev_id ==
get_user_last_machine() and no-ops if a machine is already selected,
so per-SSDP-packet invocation is cheap.

== Test plan ==

Stale-MQTT: idle the app >30min on macOS with a LAN-only printer.
Without patch, Send to Printer returns -4; with patch, the next 1Hz
refresher tick auto-reconnects and the send succeeds. Verify the log
line "LAN auto-reconnect: stale MQTT socket detected for dev_id=...".

Startup race: with a stale user_access_dev_ip (rotate slicer_uuid in
BambuStudio.conf, or power-cycle the router so SSDP is delayed),
launch Studio. Without patch the printer is never selected; with
patch the SSDP packet triggers a retry and selection succeeds.
Verify "try_load_last_machine_on_alive: SSDP-triggered retry for ...".

== Limitations ==

  - After studio inactivity (>15min), the first user action after
    wake may still see one failed send before stale-MQTT auto-
    reconnect runs on the next refresher tick. A subsequent send
    within 1s succeeds.
  - The 10s throttle is a function-static, not per-dev-id, so in
    multi-printer households a recent reconnect attempt on printer
    A can delay the next attempt on printer B by up to 10s.
  - TryLoadLastMachine's destructor joins local_bind_thread; an
    SSDP-triggered InnerLoad firing during app shutdown can stall
    the join 1-3s waiting for the bind_detect timeout.

Addresses upstream issue bambulab#9445.
@BenJule
Copy link
Copy Markdown

BenJule commented May 21, 2026

Hi @adele-with-a-b! The macOS CI failures in this PR are caused by a pre-existing infrastructure issue unrelated to your changes.

Root cause: Assimp 5.4.3's bundled zlib defines #define fdopen(fd,mode) NULL in contrib/zlib/zutil.h:147. On macOS 15+ the SDK's _stdio.h expands fdopen through __DARWIN_ALIAS(fdopen), which hits the NULL macro and produces a parse error during the "Build Deps" step.

I've submitted a fix in #10849 that disables the bundled zlib on macOS (system zlib is always available there) while keeping it enabled on Windows. Once that merges, this PR should pass CI cleanly.

BenJule added a commit to BenJule/BambuStudio that referenced this pull request May 21, 2026
… 15+

Assimp 5.4.3's bundled zlib defines `#define fdopen(fd,mode) NULL` in
contrib/zlib/zutil.h:147. On macOS 15+ the SDK's `_stdio.h` uses
`__DARWIN_ALIAS(fdopen)`, which expands through that macro and causes a
C preprocessor parse error, breaking the "Build Deps" CI step on both
arm64 and x86_64.

Fix: only build the bundled zlib on Windows (where it is needed);
on macOS and Linux use the system-provided zlib instead.

Fixes the CI failures in bambulab#10767.

Co-Authored-By: Abdel Gomez-Perez <nabdel07@icloud.com>
BenJule added a commit to BenJule/BambuStudio that referenced this pull request May 21, 2026
… 15+

Assimp 5.4.3's bundled zlib defines `#define fdopen(fd,mode) NULL` in
contrib/zlib/zutil.h:147. On macOS 15+ the SDK's `_stdio.h` uses
`__DARWIN_ALIAS(fdopen)`, which expands through that macro and causes a
C preprocessor parse error, breaking the "Build Deps" CI step on both
arm64 and x86_64.

Fix: only build the bundled zlib on Windows (where it is needed);
on macOS and Linux use the system-provided zlib instead.

Fixes the CI failures in bambulab#10767.

Co-Authored-By: Abdel Gomez-Perez <nabdel07@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bambu Studio completely forgets LAN only printer between sessions

2 participants