Make LogicalMeterActor resilient to wall-clock time jumps by shsms · Pull Request #33 · frequenz-floss/frequenz-microgrid-rs

shsms · 2026-04-23T12:28:12Z

Adds a new internal WallClockTimer that schedules resampler ticks against the
wall clock while sleeping on tokio's monotonic clock.

When the two diverge by more than one resampling interval in either direction,
the timer realigns and the actor rebuilds its inner resamplers against the new
clock frame.

Subscribers see a single None sample at the resync tick (preserving the
every-interval cadence) and real values resume on the next tick — no catch-up
burst, no duplicate samples.

Adds a WallClockTimer that schedules ticks against the wall clock while sleeping on tokio's monotonic clock. On entry to each tick the timer compares wall-elapsed against monotonic-elapsed since the previous observation; if they diverge by more than one interval (in either direction) it realigns to the wall clock and returns TickInfo { resynced: true } immediately, so callers can rebuild any timestamp-keyed state. Why wall-vs-monotonic, rather than just sleeping until a wall-clock deadline: a slow caller (event-loop lateness, long downstream work) advances both clocks by the same amount, so the difference stays bounded and the timer doesn't mistake catch-up for a jump. A real NTP-style jump shifts only the wall clock, so the difference exceeds the threshold and resync fires. Why strict `>` (not `>=`) on the threshold: a drift of exactly one interval is treated as scheduling jitter, not a jump. Pinned by exact-boundary tests. Why a custom timer instead of tokio::time::interval: the latter sleeps to a monotonic deadline, so it can't realign when the wall clock jumps — which is exactly the failure mode this change is meant to survive. The Clock trait plus the test-only TokioSyncedClock helper let paused-time tests drive both telemetry timestamps and timer wall-now from a single shared anchor, simulating a whole-machine NTP jump. Also adds the InvalidConfig ErrorKind variant for the construction-time validations (non-positive interval, interval > Duration::MAX). Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Replaces tokio::time::interval with the new WallClockTimer in the resampling loop. On a resync tick: - rebuild every inner frequenz_resampling::Resampler with its `start` aligned one interval before the realigned current tick, draining any buffered telemetry from the jumped-over window (the buffered samples are timestamped on the old wall-clock frame and would pollute the freshly-aligned resampler); - emit a single `None` sample at the realigned tick, preserving the every-interval cadence across the jump; - real values resume on the next tick. A backward jump produces a sample timestamped in the past relative to the previous one — this is documented in the release notes; consumers that assume monotonically increasing Sample timestamps must handle it. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

The variant has no remaining constructors: the only TimeDelta arithmetic that previously surfaced ChronoError now lives inside WallClockTimer, and resampling-interval misconfiguration is reported as InvalidConfig. This is a breaking public API change for downstream code that matches exhaustively on ErrorKind; called out in RELEASE_NOTES.md Upgrading. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Copilot

Pull request overview

Adds wall-clock jump resilience to logical meter resampling by introducing a wall-clock–aligned timer and resampler rebuild behavior when NTP-style jumps are detected.

Changes:

Added an internal WallClockTimer abstraction with pluggable wall-clock Clock implementations (system + tokio-synced test clock).
Updated LogicalMeterActor to drive resampling via WallClockTimer and rebuild inner resamplers on detected wall-clock jumps (emitting a single None at the resync tick).
Updated mock/test utilities and release notes to support and document the new behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/wall_clock_timer.rs`	New wall-clock aligned timer + unit tests for alignment and jump detection.
`src/logical_meter/logical_meter_handle.rs`	Adds `try_new_with_clock()` for injecting a wall clock (tests) and wires actor construction accordingly.
`src/logical_meter/logical_meter_actor.rs`	Replaces tokio interval scheduling with `WallClockTimer`, adds resampler rebuild/drain logic, and new jump-recovery tests.
`src/lib.rs`	Registers internal `wall_clock_timer` module.
`src/error.rs`	Removes `ChronoError` error kind (no longer needed by actor init).
`src/client/test_utils/tokio_synced_clock.rs`	Adds a tokio-paused-time-friendly wall clock for deterministic jump tests.
`src/client/test_utils.rs`	Updates mock telemetry timestamps to use the shared test clock and supports injecting whole-machine wall jumps.
`RELEASE_NOTES.md`	Documents the new wall-clock jump resilience behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T13:02:22Z

+            tokio::time::sleep(to_next.to_std().unwrap_or(std::time::Duration::ZERO)).await;
+            slept_this_call = true;
+        }


tokio::time::sleep(to_next.to_std().unwrap_or(Duration::ZERO)) can turn a conversion failure into a zero-duration sleep, which will busy-loop the tick() loop. This can happen when to_next (and/or interval) is too large to fit into std::time::Duration (e.g., very large resampling intervals), and it will peg a core. Prefer validating the interval at construction (e.g., require interval.to_std() succeeds) and/or handle the Err case by returning an error instead of sleeping for zero.

Copilot · 2026-04-23T13:02:23Z

+                "resampling_interval must be positive, got {:?}",
+                config.resampling_interval
+            )));
+        }


resampling_interval is now only validated as positive, but not that it fits into std::time::Duration. WallClockTimer::tick() ultimately calls TimeDelta::to_std() for sleeping, and if that conversion fails (interval too large) it falls back to a zero sleep, causing a tight loop. Consider rejecting intervals that can’t be represented as std::time::Duration here (return ErrorKind::InvalidConfig) so misconfiguration can’t trigger a CPU spin.

Suggested change

}

}

if config.resampling_interval.to_std().is_err() {

return Err(Error::invalid_config(format!(

"resampling_interval must fit into std::time::Duration, got {:?}",

config.resampling_interval

)));

}

Copilot · 2026-04-23T13:02:23Z

 ErrorKind!(
    (ComponentGraphError, component_graph_error),
    (ComponentDataError, component_data_error),
    (ConnectionFailure, connection_failure),
-    (ChronoError, chrono_error),
    (DroppedUnusedFormulas, dropped_unused_formulas),
    (FormulaEngineError, formula_engine_error),
    (InvalidComponent, invalid_component),
+    (InvalidConfig, invalid_config),
    (Internal, internal),
    (APIServerError, api_server_error),
 );


Removing the ChronoError variant from the public ErrorKind enum is a breaking change for downstream users that pattern-match on ErrorKind. If this isn’t intended to be a semver-major change, consider keeping ChronoError (even if unused) and introducing InvalidConfig in addition, or document the breaking change prominently (e.g., in RELEASE_NOTES.md Upgrading section + version bump).

hannah-stevenson-frequenz · 2026-04-23T15:33:52Z

my understanding, which might still be wrong, is that the problem is:

samples are wall clock stamped. i'm not sure if it's when they were collected or when they arrive but this is all on the same machine so there won't be much difference
the resampler has a start variable, set at instantiation time.
when the actor asks for a sample it passes in resampler_ts which is a free running clock, updated by a fixed amount every time the monotonic clock says a time has past but disconnected from the wall clock.
the sample window is [start, resampler_ts)
when the wall clock has drifted too much then the sampling window no longer reflects the past resample period.

it might be simpler if the resampler is asked for the "last sample" which is the current wall clock time, floored to some point (e.g. whole seconds) to make sure different clients are aligned, minus the resampling interval and no clock is kept at all. the client wakes up based on the monotonic clock, asks for the last sample and it doesn't matter about shifting clocks. whilst simpler there maybe use cases this would break for of which i am unaware - i don't know if ppl try to get historical data for example, or multiple samples. this isn't the approach of the PR though.

this is just for other confused readers as it took me a while to get my head around this and the problem has popped up a few times.

llucax · 2026-04-27T10:17:14Z

it might be simpler if the resampler is asked for the "last sample" which is the current wall clock time, floored to some point (e.g. whole seconds) to make sure different clients are aligned, minus the resampling interval and no clock is kept at all.

Interesting. Not sure about the resampler in this repo, but on the SDK (this is based on the SDK timer AFAIK) that's not good enough because you could stop receiving data, so asking for the last sample might give you a very old sample (and the resampler will still send resampler data anyway, worse case with value = None, but we want it to keep ticking), so not very reliable for that scenario.

hannah-stevenson-frequenz · 2026-04-27T10:19:24Z

so asking for the last sample might give you a very old sample

it wouldn't because it would be outside of the time window. it wouldn't go backs searching for data, just things it already has within the window.

llucax · 2026-04-27T11:08:52Z

it wouldn't because it would be outside of the time window. it wouldn't go backs searching for data, just things it already has within the window.

So if you don't consider old data, how do you know if there is no data because you stopped receiving data or because you jumped into the future? How do you decide which timestamp to use to emit those samples to keep the continuous ticking?

I have the feeling is not that simple, but again, I went deep into this a long time ago so I don't have it fresh. Since I was already asking AI to do a PR review, I asked it to help me consider this simple approach. The reply is long, but it seems to align with my intuition that for continuous aligned ticking it might not be that trivial.

AI analysis

That proposal makes sense as a description of why time jumps break the current resampling windows, but it does not look like a complete replacement for the PR/SDK approach.

Short version

The proposed "just wake up on the monotonic clock and resample the latest wall-clock bucket" can work for a different API, but not cleanly for the current one, because the current logical meter needs a stream with one tick per interval, aligned to wall-clock boundaries as closely as possible, including when there is no input data.

Why the simpler idea is not a drop-in replacement

It still needs state.

If you compute bucket = floor(wall_now, interval) on every wake-up, you still need to remember what bucket you emitted last.

Without that:

if you wake twice in the same bucket, you duplicate output
if you wake late and skip buckets, you silently drop intervals
if the wall clock jumps backwards, you may ask for an already-emitted bucket again

So "no clock is kept at all" is not really true once you define correct behavior.

Backward jumps are the hardest case, and this proposal does not solve them.

Example:

you already emitted the bucket ending at 12:00:10
wall clock jumps back to 11:59:40
next monotonic wake computes a much earlier bucket

Now what?

Possible behaviors are all problematic:

emit that older bucket again: duplicates / time goes backwards
refuse to emit until wall clock catches back up: you get a long gap with no ticks
rebuild and realign immediately: that is basically what this PR does

So clock shifts do still matter; this proposal just postpones the policy decision.

Forward jumps or slow processing can skip intervals.

If wall clock jumps forward by 30 seconds, or the actor is blocked for 30 seconds, then "give me the latest completed bucket" only gives you one bucket.

That means you either:

drop 29 intervals, which breaks the current streaming semantics
or emit all missed intervals, which again requires remembered state and catch-up logic

The current code expects exactly one resampled output per tick. You can see that in src/logical_meter/logical_meter_actor.rs:511-517.

It does not give wall-clock-aligned ticks, only wall-clock-aligned timestamps.

Those are different.

With the proposal, you wake on a monotonic cadence and only then ask "what was the latest completed wall-clock bucket?". That means the sample may be produced noticeably after the bucket boundary.

The current timer tries to do better: it sleeps on monotonic time but schedules ticks against wall-clock boundaries.

So if the requirement is "tick every interval, as aligned to the wall clock as possible", the simpler proposal is weaker.

It does not match the current resampler model very well.

The current resampler is stateful. It has a start/alignment and is advanced over time. The actor rebuilds it after a jump because that internal alignment frame becomes stale.

So the simpler idea would fit better with a different resampler API like:

"given [start, end), compute that bucket from retained history"

But that is not the current design. With the current design, backward jumps and re-asking old buckets are especially awkward.

Missing-data periods still need regular ticks.

Your intuition here is correct.

If no telemetry arrives for a while, the logical meter still needs to emit one output per interval, typically None.

A "latest bucket" query can do that only if you are still waking reliably once per interval and tracking which buckets have already been emitted. Again, that brings back timer/state logic.

What the proposal gets right

The proposal is right about the core bug mechanism:

telemetry samples are stamped in wall-clock time
resampling windows must therefore also be defined in wall-clock time
if your output schedule drifts away from the wall-clock frame, windows become wrong

That part is valid.

What it misses

It misses that the problem is not only "which bucket is latest now?", but also:

exactly-once emission per interval
behavior on backward jumps
behavior when several intervals are crossed between wake-ups
emitting None during silent periods
keeping multiple subscribers aligned

Those are exactly the reasons the PR/SDK use a wall-clock-aligned timer plus explicit resync behavior.

Bottom line

The proposal is feasible only if you accept a different contract:

"give me the latest completed bucket when I happen to wake up"

It is not equivalent to:

"produce a continuous per-interval stream aligned to wall-clock boundaries, including through time jumps and missing-data periods"

For the current logical meter semantics, the simpler proposal has real flaws, and your intuition is not wrong.

llucax

OK, this is a first AI-assisted round (I asked it to compare to the SDK implementation, and did a quick check on the findings). To be honest, to do an in-depth round myself will probably take too much time, as it's been a while since I did the SDK implementation and I'm not familiar with rust. We can see if it is necessary after this round.

Some extra comments about commits:

f7e9d09 (Add NTP-jump-resilient wall clock timer) and abfa4b4 (Use WallClockTimer in LogicalMeterActor for time-jump resilience) are both substantial functional commits, but both have empty commit bodies. Given how much behavior changed here, a short explanation of the design tradeoffs in the commit messages would make the history much easier to follow later.
56ebc5e (Remove ChronoError variant from ErrorKind) removes the public ErrorKind::ChronoError variant, but the commit message does not explain why that removal is safe / intended. Since that is upgrade-significant, the rationale should ideally be visible in history too.

hannah-stevenson-frequenz · 2026-04-27T12:22:13Z

So if you don't consider old data, how do you know if there is no data because you stopped receiving data or because you jumped into the future?

There is no simple way. I did say there may be hidden requirements I was unaware of - this just gets you samples from the last sample period and not much else. If we want to monitor the stream and do some stats and e.g. give reasons there were no samples in a window I would probably have that separately - but it would depend on exactly what monitoring we wanted to perform.

How do you decide which timestamp to use to emit those samples to keep the continuous ticking?

I'm not sure what is the continuous ticking that you want? I was just solving the problem of "samples in the last window". If there are other requirements I am unaware of them and they won't be taken into account. This probably isn't the place for this discussion so if you want to continue it we should write out the requirements and then see if there is anything simpler.

On the AI comments, most assume more requirements than the last window - the ability to fetch historical windows for example. That could be added but again without clear requirements thinking about it is fruitless.

I added the comment because the PR was hard to understand (for me) given I didn't understand the design, and didn't understand the reason for the design, and I thought others might be in the same boat.

Extracts the broadcast-receiver drain loop into a free poll_telemetry function. Both call sites (the regular resample loop and the post-jump drain in rebuild_resamplers_after_jump) shared the same Empty / Lagged / Closed handling — pulling it into one helper keeps the two paths from drifting apart. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Both `start_resamplers` and `rebuild_resamplers_after_jump` constructed their inner `frequenz_resampling::Resampler` with the same lookup chain (metric override → configured default → Average) and the same `max_age_in_intervals` clamp. Pulling that into `build_resampler` keeps the two paths consistent as `LogicalMeterConfig` evolves — important now that the rebuild path is correctness-critical for jump recovery. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Lagged means the receiver fell behind enough that the channel dropped samples — a real data-loss event, not just an internal scheduling detail. It can fire during a wall-clock jump (the server may burst samples to refill the buffer) or under sustained back-pressure; either way operators want to see it without a debug filter. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

`WallClockTimer::try_new` already rejects this internally, but surfacing the check at the actor boundary gives a slightly more domain-specific error message ("resampling_interval must be positive") and pins the contract that callers see InvalidConfig — not whatever the timer happens to report — for misconfigured intervals. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

The inner `frequenz_resampling::Resampler` takes its max-age as `i32`, so the actor previously had to clamp `config.max_age_in_intervals` (`u32`) with `.min(i32::MAX as u32) as i32`. That silently truncated user configuration without telling them their effective retention window had been altered. Validate at construction instead and surface InvalidConfig, so an out-of-range value fails loudly. The clamp in build_resampler is no longer needed and the cast becomes a plain `as i32`. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

`new()` slept up to ~1 s of wall-clock time aligning to the next whole second so the mock's telemetry timestamps would land on the resampler's interval boundaries. The anchor is now built directly from `next_sec_secs` and handed to `TokioSyncedClock::with_wall_anchor`, so the boundary is reported at construction without waiting for real time to reach it. The sleep was incidentally syncing the mock's clock with a later `TokioSyncedClock::new()` (anchored to `Utc::now()`) that `new_logical_meter_handle` was passing to `LogicalMeterHandle`; expose the mock's clock via a new `clock()` accessor and share it with the actor so the two sides remain aligned without depending on real wall time. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

Address every lint clippy reports under `--all-targets`: - `inconsistent_digit_grouping` in a frequency test (regroup the literal at the standard 3-digit boundary). - `excessive_precision` in current/energy formatting tests (truncate the f32 literals to a representable mantissa; the formatter still rounds to the asserted output). - `useless_vec` in two `bounds` test fixtures (use array literals). - `redundant_closure` in a `logical_meter_handle` test helper (pass the function reference directly). - `clone_on_copy` on `Option<Timestamp>` and a stray `into_iter()` in `test_utils`'s mock telemetry stream. - `collapsible_if` in `receive_electrical_component_telemetry_stream` (fold the two guards into a single let-chain, edition 2024). - `type_complexity` on `MockComponent::metrics` (factor out a `MockMetricRow` typedef). `cargo fmt --check` was already clean, no formatting changes. Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

llucax

Approving from the logical part, not sure if anyone else with more rust-fu should approve too, I leave that to you to decide.

shsms added 4 commits April 23, 2026 09:59

Add LogicalMeterActor tests for resampling and NTP time jump recovery

f5321d9

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

shsms requested a review from Copilot April 23, 2026 12:28

Copilot started reviewing on behalf of shsms April 23, 2026 12:28 View session

shsms force-pushed the resampler-timejump branch from 53311fa to 61fa5e0 Compare April 23, 2026 12:31

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread src/wall_clock_timer.rs Outdated

Comment thread src/wall_clock_timer.rs Outdated

Comment thread src/logical_meter/logical_meter_actor.rs Outdated

Comment thread src/wall_clock_timer.rs Outdated

shsms force-pushed the resampler-timejump branch 2 times, most recently from 4b0e32d to 95dc8a6 Compare April 23, 2026 12:55

shsms requested a review from Copilot April 23, 2026 12:57

Copilot started reviewing on behalf of shsms April 23, 2026 12:57 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

llucax self-requested a review April 27, 2026 09:52

llucax reviewed Apr 27, 2026

View reviewed changes

shsms force-pushed the resampler-timejump branch 3 times, most recently from 8dde103 to 1af1b40 Compare April 30, 2026 15:06

shsms added 6 commits April 30, 2026 15:51

Update release notes

47f7253

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>

shsms force-pushed the resampler-timejump branch from 1af1b40 to c7e476f Compare April 30, 2026 15:57

shsms added 2 commits May 4, 2026 09:39

shsms force-pushed the resampler-timejump branch from c7e476f to 05d64e2 Compare May 4, 2026 09:51

shsms requested a review from llucax May 4, 2026 09:54

llucax approved these changes May 4, 2026

View reviewed changes

shsms force-pushed the resampler-timejump branch from 05d64e2 to 1cb49de Compare May 4, 2026 13:39

shsms added this pull request to the merge queue May 4, 2026

Merged via the queue into frequenz-floss:v0.x.x with commit ef35ed6 May 4, 2026
3 checks passed

shsms deleted the resampler-timejump branch May 4, 2026 13:43

-        }
+        }
+        if config.resampling_interval.to_std().is_err() {
+            return Err(Error::invalid_config(format!(
+                "resampling_interval must fit into std::time::Duration, got {:?}",
+                config.resampling_interval
+            )));
+        }

Conversation

shsms commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

hannah-stevenson-frequenz commented Apr 23, 2026

Uh oh!

llucax commented Apr 27, 2026

Uh oh!

hannah-stevenson-frequenz commented Apr 27, 2026

Uh oh!

llucax commented Apr 27, 2026

Uh oh!

llucax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hannah-stevenson-frequenz commented Apr 27, 2026

Uh oh!

llucax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants