Skip to content

Panic on backward system-clock adjustment (sleep/wake NTP correction) wedges IDS auth and downstream CloudKit sync #29

@richardpowellus

Description

@richardpowellus

Summary

CachedKeys::get_stale_time() in src/ids/identity_manager.rs#L45-L49 panics with Time went backwards whenever the system clock is corrected backward (e.g. NTP resync after sleep/wake). Because the function is called from is_valid() (L51-L58) and is_dirty() (L61-L65), which are invoked on every cached-identity check (L283, L305), the panic kills the IDS auth path and cascades into every downstream subsystem — most visibly CloudKit sync wedges into a retry loop until the app is force-killed.

Reproduction

Confirmed on Windows 11 (ARM64), OpenBubbles 1.18.800.0 (Microsoft Store) with rustpush as bundled.

  1. Run OpenBubbles with CloudKit sync enabled (cloudSyncingEnabled: true).
  2. Put the machine into Modern Standby for several hours/days.
  3. Wake the machine. Windows NTP corrects the system clock backward (typically a few seconds; in my case ~6.5s after a 3.5-day sleep).
  4. OpenBubbles thread immediately panics with PanicException(Time went backwards: SystemTimeError(<N>s)).
  5. The CloudKit sync supervisor retries on a ~4.6s interval, producing a steady stream of failures (in my log: exactly 13 errors/min for ~9 minutes straight) and pegging one CPU core at ~96% indefinitely. The Flutter UI freezes; only force-killing the process recovers.

Evidence

Panic from the app log

2026-06-02T15:38:18.658213Z [ERROR] [BlueBubblesApp] PanicException(Time went backwards: SystemTimeError(4.9310864s))
2026-06-02T15:38:18.731222Z [ERROR] [BlueBubblesApp] PanicException(Time went backwards: SystemTimeError(4.7560443s))
2026-06-02T15:38:19.006534Z [ERROR] [BlueBubblesApp] PanicException(Time went backwards: SystemTimeError(5.7651135s))
2026-06-02T15:38:19.604706Z [ERROR] [BlueBubblesApp] PanicException(Time went backwards: SystemTimeError(3.8677244s))
2026-06-02T15:38:19.633064Z [ERROR] [BlueBubblesApp] PanicException(Time went backwards: SystemTimeError(3.8665786s))
... (panic stack into SimpleDecoder.decode at flutter_rust_bridge/src/codec/base.dart:35)

Correlated Windows Kernel-General clock-adjust events (same machine, same minute)

2026-06-02 08:37:32 (Event 1) — system time set to 2026-06-02T15:37:32Z from 2026-05-29T20:28:02Z (wake from 3.5-day sleep)
2026-06-02 08:38:18 (Event 1) — system time set to 2026-06-02T15:38:18.489Z from 2026-06-02T15:38:25.010Z   ← clock jumped 6.5s BACKWARD
2026-06-02 08:37:38 (Power 507) — exiting Modern Standby

The 6.5s backward NTP correction at 08:38:18 PDT (= 15:38:18 UTC) is exactly the trigger for the panic burst that started <200 ms later. The 3-5s panic magnitudes match the NTP correction window. After the initial burst, the retry loop maintained a steady 13 errors/min for 9+ minutes before I killed the process.

Steady cadence proves a wedged retry loop, not just a one-shot panic

Errors per minute on 2026-06-02:
  17:15  13
  17:14  13
  17:13  13
  17:12  13
  17:11  13
  17:10  13
  17:09  13
  17:08  13
  17:07  13

Root cause

src/ids/identity_manager.rs#L45-L49:

fn get_stale_time(&self) -> Duration {
    SystemTime::now()
        .duration_since(UNIX_EPOCH + Duration::from_millis(self.at_ms))
        .expect("Time went backwards")
}

When SystemTime::now() is earlier than self.at_ms (which happens whenever the wall clock is adjusted backward), duration_since returns Err(SystemTimeError) and .expect() panics. self.at_ms was captured before the NTP correction; SystemTime::now() is read after. The cache thinks its keys are from "the future" relative to the wall clock — which is physically impossible but trivially produced by the OS.

The panic propagates up through the Tokio task, the IDS lookup task is gone, but the supervising CloudKit code keeps retrying — and each retry hits the same panicking code path because the cached at_ms values are still in the (now-relative) future. The loop only breaks when the wall clock catches back up (potentially many seconds, or never, if drift cancels out).

Impact

  • Severity: High. Single-event trigger → unrecoverable application hang requiring force-kill. Reproducible on every wake from a non-trivial sleep on Windows (where backward NTP corrections after Modern Standby are routine).
  • Blast radius: Any code path that gates on CachedKeys::is_valid or is_dirty — i.e. every IDS-authenticated request, including CloudKit (gateway.icloud.com/ckdatabase/...), iMessage delivery checks, etc.
  • User-visible behaviour: App freezes; CPU pegs at one full core; iCloud sync, contact sync, and message sending all silently break until the user notices and force-kills.

Proposed fix

Replace the panic with a safe fallback. Three options, ordered by minimal-change → robust:

1. Minimal — saturate to zero on backward skew:

fn get_stale_time(&self) -> Duration {
    SystemTime::now()
        .duration_since(UNIX_EPOCH + Duration::from_millis(self.at_ms))
        .unwrap_or(Duration::ZERO)
}

This treats a clock-backward situation as "the key was just refreshed," which is conservative (keys appear fresh until the next genuine staleness check). Safe because is_valid() returning true on fresh keys is the no-op path.

2. Better — return Duration::MAX (force re-auth):

.unwrap_or(Duration::MAX)

Forces an immediate refresh, which is the right behaviour if the wall clock genuinely moved by an unexpected amount.

3. Best — use a monotonic clock for staleness:
Store Instant::now() alongside at_ms at cache-insertion time and compute staleness against Instant. Monotonic clocks are immune to wall-clock skew. (Drawback: Instant is not persistable, so this only works for in-memory caches — but CachedKeys already looks in-memory based on the surrounding code.)

Related class of bugs

There are 15+ other SystemTime::now().duration_since(...).unwrap() / .expect() call sites in the codebase that have the same fundamental problem. Found via gh search code 'duration_since repo:OpenBubbles/rustpush':

  • src/statuskit.rs — 2 sites
  • src/passwords.rs — 1 site
  • src/findmy.rs — 1 site
  • src/ids/user.rs — 1 site
  • src/util.rs — 2 sites (duration_since(SystemTime::UNIX_EPOCH).unwrap())
  • src/auth.rsmme_refreshed weekly check
  • src/facetime.rs, src/imessage/aps_client.rs, src/imessage/messages.rs, src/icloud/keychain.rs, src/ids/identity_manager.rs, cloudkit-proto/src/lib.rs — various
  • src/sharedstreams.rsround_seconds()

identity_manager.rs#L45 is the one I caught panicking, but any of these can panic the same way under the right clock-skew conditions. Worth fixing as a class — perhaps a small helper duration_since_safe() in util.rs that returns Duration::ZERO (or whatever the call site needs) on Err.

Workaround for users until fixed

Disable CloudKit sync via flutter.cloudSyncingEnabled = false and flutter.attachmentSyncEnabled = false in shared_preferences.json. iMessages still deliver via APS reflection; only cross-device iCloud history sync is lost.


Happy to send a PR for option 1 or 3 if you have a preferred approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions