Skip to content

Make LogicalMeterActor resilient to wall-clock time jumps#33

Merged
shsms merged 12 commits into
frequenz-floss:v0.x.xfrom
shsms:resampler-timejump
May 4, 2026
Merged

Make LogicalMeterActor resilient to wall-clock time jumps#33
shsms merged 12 commits into
frequenz-floss:v0.x.xfrom
shsms:resampler-timejump

Conversation

@shsms
Copy link
Copy Markdown
Collaborator

@shsms shsms commented Apr 23, 2026

Adds a new internal WallClockTimer that schedules resampler ticks against the
wall clock while sleeping on tokio's monotonic clock.

When the two diverge by more than one resampling interval in either direction,
the timer realigns and the actor rebuilds its inner resamplers against the new
clock frame.

Subscribers see a single None sample at the resync tick (preserving the
every-interval cadence) and real values resume on the next tick — no catch-up
burst, no duplicate samples.

shsms added 4 commits April 23, 2026 09:59
Adds a WallClockTimer that schedules ticks against the wall clock while
sleeping on tokio's monotonic clock. On entry to each tick the timer
compares wall-elapsed against monotonic-elapsed since the previous
observation; if they diverge by more than one interval (in either
direction) it realigns to the wall clock and returns
TickInfo { resynced: true } immediately, so callers can rebuild any
timestamp-keyed state.

Why wall-vs-monotonic, rather than just sleeping until a wall-clock
deadline: a slow caller (event-loop lateness, long downstream work)
advances both clocks by the same amount, so the difference stays
bounded and the timer doesn't mistake catch-up for a jump. A real
NTP-style jump shifts only the wall clock, so the difference exceeds
the threshold and resync fires.

Why strict `>` (not `>=`) on the threshold: a drift of exactly one
interval is treated as scheduling jitter, not a jump. Pinned by
exact-boundary tests.

Why a custom timer instead of tokio::time::interval: the latter sleeps
to a monotonic deadline, so it can't realign when the wall clock jumps
— which is exactly the failure mode this change is meant to survive.

The Clock trait plus the test-only TokioSyncedClock helper let
paused-time tests drive both telemetry timestamps and timer wall-now
from a single shared anchor, simulating a whole-machine NTP jump.

Also adds the InvalidConfig ErrorKind variant for the construction-time
validations (non-positive interval, interval > Duration::MAX).

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Replaces tokio::time::interval with the new WallClockTimer in the
resampling loop. On a resync tick:

  - rebuild every inner frequenz_resampling::Resampler with its `start`
    aligned one interval before the realigned current tick, draining
    any buffered telemetry from the jumped-over window (the buffered
    samples are timestamped on the old wall-clock frame and would
    pollute the freshly-aligned resampler);
  - emit a single `None` sample at the realigned tick, preserving the
    every-interval cadence across the jump;
  - real values resume on the next tick.

A backward jump produces a sample timestamped in the past relative to
the previous one — this is documented in the release notes; consumers
that assume monotonically increasing Sample timestamps must handle it.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
The variant has no remaining constructors: the only TimeDelta arithmetic
that previously surfaced ChronoError now lives inside WallClockTimer,
and resampling-interval misconfiguration is reported as InvalidConfig.

This is a breaking public API change for downstream code that matches
exhaustively on ErrorKind; called out in RELEASE_NOTES.md Upgrading.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds wall-clock jump resilience to logical meter resampling by introducing a wall-clock–aligned timer and resampler rebuild behavior when NTP-style jumps are detected.

Changes:

  • Added an internal WallClockTimer abstraction with pluggable wall-clock Clock implementations (system + tokio-synced test clock).
  • Updated LogicalMeterActor to drive resampling via WallClockTimer and rebuild inner resamplers on detected wall-clock jumps (emitting a single None at the resync tick).
  • Updated mock/test utilities and release notes to support and document the new behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/wall_clock_timer.rs New wall-clock aligned timer + unit tests for alignment and jump detection.
src/logical_meter/logical_meter_handle.rs Adds try_new_with_clock() for injecting a wall clock (tests) and wires actor construction accordingly.
src/logical_meter/logical_meter_actor.rs Replaces tokio interval scheduling with WallClockTimer, adds resampler rebuild/drain logic, and new jump-recovery tests.
src/lib.rs Registers internal wall_clock_timer module.
src/error.rs Removes ChronoError error kind (no longer needed by actor init).
src/client/test_utils/tokio_synced_clock.rs Adds a tokio-paused-time-friendly wall clock for deterministic jump tests.
src/client/test_utils.rs Updates mock telemetry timestamps to use the shared test clock and supports injecting whole-machine wall jumps.
RELEASE_NOTES.md Documents the new wall-clock jump resilience behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/wall_clock_timer.rs Outdated
Comment thread src/wall_clock_timer.rs Outdated
Comment thread src/logical_meter/logical_meter_actor.rs Outdated
Comment thread src/wall_clock_timer.rs Outdated
@shsms shsms force-pushed the resampler-timejump branch 2 times, most recently from 4b0e32d to 95dc8a6 Compare April 23, 2026 12:55
@shsms shsms requested a review from Copilot April 23, 2026 12:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/wall_clock_timer.rs Outdated
Comment on lines +129 to +131
tokio::time::sleep(to_next.to_std().unwrap_or(std::time::Duration::ZERO)).await;
slept_this_call = true;
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::time::sleep(to_next.to_std().unwrap_or(Duration::ZERO)) can turn a conversion failure into a zero-duration sleep, which will busy-loop the tick() loop. This can happen when to_next (and/or interval) is too large to fit into std::time::Duration (e.g., very large resampling intervals), and it will peg a core. Prefer validating the interval at construction (e.g., require interval.to_std() succeeds) and/or handle the Err case by returning an error instead of sleeping for zero.

Copilot uses AI. Check for mistakes.
"resampling_interval must be positive, got {:?}",
config.resampling_interval
)));
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resampling_interval is now only validated as positive, but not that it fits into std::time::Duration. WallClockTimer::tick() ultimately calls TimeDelta::to_std() for sleeping, and if that conversion fails (interval too large) it falls back to a zero sleep, causing a tight loop. Consider rejecting intervals that can’t be represented as std::time::Duration here (return ErrorKind::InvalidConfig) so misconfiguration can’t trigger a CPU spin.

Suggested change
}
}
if config.resampling_interval.to_std().is_err() {
return Err(Error::invalid_config(format!(
"resampling_interval must fit into std::time::Duration, got {:?}",
config.resampling_interval
)));
}

Copilot uses AI. Check for mistakes.
Comment thread src/error.rs
Comment on lines 55 to 65
ErrorKind!(
(ComponentGraphError, component_graph_error),
(ComponentDataError, component_data_error),
(ConnectionFailure, connection_failure),
(ChronoError, chrono_error),
(DroppedUnusedFormulas, dropped_unused_formulas),
(FormulaEngineError, formula_engine_error),
(InvalidComponent, invalid_component),
(InvalidConfig, invalid_config),
(Internal, internal),
(APIServerError, api_server_error),
);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the ChronoError variant from the public ErrorKind enum is a breaking change for downstream users that pattern-match on ErrorKind. If this isn’t intended to be a semver-major change, consider keeping ChronoError (even if unused) and introducing InvalidConfig in addition, or document the breaking change prominently (e.g., in RELEASE_NOTES.md Upgrading section + version bump).

Copilot uses AI. Check for mistakes.
@hannah-stevenson-frequenz
Copy link
Copy Markdown

my understanding, which might still be wrong, is that the problem is:

samples are wall clock stamped. i'm not sure if it's when they were collected or when they arrive but this is all on the same machine so there won't be much difference
the resampler has a start variable, set at instantiation time.
when the actor asks for a sample it passes in resampler_ts which is a free running clock, updated by a fixed amount every time the monotonic clock says a time has past but disconnected from the wall clock.
the sample window is [start, resampler_ts)
when the wall clock has drifted too much then the sampling window no longer reflects the past resample period.

it might be simpler if the resampler is asked for the "last sample" which is the current wall clock time, floored to some point (e.g. whole seconds) to make sure different clients are aligned, minus the resampling interval and no clock is kept at all. the client wakes up based on the monotonic clock, asks for the last sample and it doesn't matter about shifting clocks. whilst simpler there maybe use cases this would break for of which i am unaware - i don't know if ppl try to get historical data for example, or multiple samples. this isn't the approach of the PR though.

this is just for other confused readers as it took me a while to get my head around this and the problem has popped up a few times.

@llucax llucax self-requested a review April 27, 2026 09:52
@llucax
Copy link
Copy Markdown

llucax commented Apr 27, 2026

it might be simpler if the resampler is asked for the "last sample" which is the current wall clock time, floored to some point (e.g. whole seconds) to make sure different clients are aligned, minus the resampling interval and no clock is kept at all.

Interesting. Not sure about the resampler in this repo, but on the SDK (this is based on the SDK timer AFAIK) that's not good enough because you could stop receiving data, so asking for the last sample might give you a very old sample (and the resampler will still send resampler data anyway, worse case with value = None, but we want it to keep ticking), so not very reliable for that scenario.

@hannah-stevenson-frequenz
Copy link
Copy Markdown

so asking for the last sample might give you a very old sample

it wouldn't because it would be outside of the time window. it wouldn't go backs searching for data, just things it already has within the window.

@llucax
Copy link
Copy Markdown

llucax commented Apr 27, 2026

it wouldn't because it would be outside of the time window. it wouldn't go backs searching for data, just things it already has within the window.

So if you don't consider old data, how do you know if there is no data because you stopped receiving data or because you jumped into the future? How do you decide which timestamp to use to emit those samples to keep the continuous ticking?

I have the feeling is not that simple, but again, I went deep into this a long time ago so I don't have it fresh. Since I was already asking AI to do a PR review, I asked it to help me consider this simple approach. The reply is long, but it seems to align with my intuition that for continuous aligned ticking it might not be that trivial.

AI analysis

That proposal makes sense as a description of why time jumps break the current resampling windows, but it does not look like a complete replacement for the PR/SDK approach.

Short version

The proposed "just wake up on the monotonic clock and resample the latest wall-clock bucket" can work for a different API, but not cleanly for the current one, because the current logical meter needs a stream with one tick per interval, aligned to wall-clock boundaries as closely as possible, including when there is no input data.

Why the simpler idea is not a drop-in replacement

  1. It still needs state.

If you compute bucket = floor(wall_now, interval) on every wake-up, you still need to remember what bucket you emitted last.

Without that:

  • if you wake twice in the same bucket, you duplicate output
  • if you wake late and skip buckets, you silently drop intervals
  • if the wall clock jumps backwards, you may ask for an already-emitted bucket again

So "no clock is kept at all" is not really true once you define correct behavior.

  1. Backward jumps are the hardest case, and this proposal does not solve them.

Example:

  • you already emitted the bucket ending at 12:00:10
  • wall clock jumps back to 11:59:40
  • next monotonic wake computes a much earlier bucket

Now what?

Possible behaviors are all problematic:

  • emit that older bucket again: duplicates / time goes backwards
  • refuse to emit until wall clock catches back up: you get a long gap with no ticks
  • rebuild and realign immediately: that is basically what this PR does

So clock shifts do still matter; this proposal just postpones the policy decision.

  1. Forward jumps or slow processing can skip intervals.

If wall clock jumps forward by 30 seconds, or the actor is blocked for 30 seconds, then "give me the latest completed bucket" only gives you one bucket.

That means you either:

  • drop 29 intervals, which breaks the current streaming semantics
  • or emit all missed intervals, which again requires remembered state and catch-up logic

The current code expects exactly one resampled output per tick. You can see that in src/logical_meter/logical_meter_actor.rs:511-517.

  1. It does not give wall-clock-aligned ticks, only wall-clock-aligned timestamps.

Those are different.

With the proposal, you wake on a monotonic cadence and only then ask "what was the latest completed wall-clock bucket?". That means the sample may be produced noticeably after the bucket boundary.

The current timer tries to do better: it sleeps on monotonic time but schedules ticks against wall-clock boundaries.

So if the requirement is "tick every interval, as aligned to the wall clock as possible", the simpler proposal is weaker.

  1. It does not match the current resampler model very well.

The current resampler is stateful. It has a start/alignment and is advanced over time. The actor rebuilds it after a jump because that internal alignment frame becomes stale.

So the simpler idea would fit better with a different resampler API like:

  • "given [start, end), compute that bucket from retained history"

But that is not the current design. With the current design, backward jumps and re-asking old buckets are especially awkward.

  1. Missing-data periods still need regular ticks.

Your intuition here is correct.

If no telemetry arrives for a while, the logical meter still needs to emit one output per interval, typically None.

A "latest bucket" query can do that only if you are still waking reliably once per interval and tracking which buckets have already been emitted. Again, that brings back timer/state logic.

What the proposal gets right

The proposal is right about the core bug mechanism:

  • telemetry samples are stamped in wall-clock time
  • resampling windows must therefore also be defined in wall-clock time
  • if your output schedule drifts away from the wall-clock frame, windows become wrong

That part is valid.

What it misses

It misses that the problem is not only "which bucket is latest now?", but also:

  • exactly-once emission per interval
  • behavior on backward jumps
  • behavior when several intervals are crossed between wake-ups
  • emitting None during silent periods
  • keeping multiple subscribers aligned

Those are exactly the reasons the PR/SDK use a wall-clock-aligned timer plus explicit resync behavior.

Bottom line

The proposal is feasible only if you accept a different contract:

  • "give me the latest completed bucket when I happen to wake up"

It is not equivalent to:

  • "produce a continuous per-interval stream aligned to wall-clock boundaries, including through time jumps and missing-data periods"

For the current logical meter semantics, the simpler proposal has real flaws, and your intuition is not wrong.

Copy link
Copy Markdown

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is a first AI-assisted round (I asked it to compare to the SDK implementation, and did a quick check on the findings). To be honest, to do an in-depth round myself will probably take too much time, as it's been a while since I did the SDK implementation and I'm not familiar with rust. We can see if it is necessary after this round.

Some extra comments about commits:

  1. f7e9d09 (Add NTP-jump-resilient wall clock timer) and abfa4b4 (Use WallClockTimer in LogicalMeterActor for time-jump resilience) are both substantial functional commits, but both have empty commit bodies. Given how much behavior changed here, a short explanation of the design tradeoffs in the commit messages would make the history much easier to follow later.
  2. 56ebc5e (Remove ChronoError variant from ErrorKind) removes the public ErrorKind::ChronoError variant, but the commit message does not explain why that removal is safe / intended. Since that is upgrade-significant, the rationale should ideally be visible in history too.

Comment thread src/wall_clock_timer.rs Outdated
Comment thread src/wall_clock_timer.rs Outdated
Comment thread RELEASE_NOTES.md Outdated
Comment thread src/error.rs
Comment thread src/wall_clock_timer.rs
Comment thread src/logical_meter/logical_meter_actor.rs Outdated
Comment thread src/client/test_utils.rs Outdated
Comment thread src/wall_clock_timer.rs
Comment thread src/client/test_utils.rs
Comment thread src/logical_meter/logical_meter_actor.rs Outdated
@hannah-stevenson-frequenz
Copy link
Copy Markdown

So if you don't consider old data, how do you know if there is no data because you stopped receiving data or because you jumped into the future?

There is no simple way. I did say there may be hidden requirements I was unaware of - this just gets you samples from the last sample period and not much else. If we want to monitor the stream and do some stats and e.g. give reasons there were no samples in a window I would probably have that separately - but it would depend on exactly what monitoring we wanted to perform.

How do you decide which timestamp to use to emit those samples to keep the continuous ticking?

I'm not sure what is the continuous ticking that you want? I was just solving the problem of "samples in the last window". If there are other requirements I am unaware of them and they won't be taken into account. This probably isn't the place for this discussion so if you want to continue it we should write out the requirements and then see if there is anything simpler.

On the AI comments, most assume more requirements than the last window - the ability to fetch historical windows for example. That could be added but again without clear requirements thinking about it is fruitless.

I added the comment because the PR was hard to understand (for me) given I didn't understand the design, and didn't understand the reason for the design, and I thought others might be in the same boat.

@shsms shsms force-pushed the resampler-timejump branch 3 times, most recently from 8dde103 to 1af1b40 Compare April 30, 2026 15:06
shsms added 6 commits April 30, 2026 15:51
Extracts the broadcast-receiver drain loop into a free poll_telemetry
function. Both call sites (the regular resample loop and the post-jump
drain in rebuild_resamplers_after_jump) shared the same Empty / Lagged /
Closed handling — pulling it into one helper keeps the two paths from
drifting apart.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Both `start_resamplers` and `rebuild_resamplers_after_jump` constructed
their inner `frequenz_resampling::Resampler` with the same lookup
chain (metric override → configured default → Average) and the same
`max_age_in_intervals` clamp. Pulling that into `build_resampler` keeps
the two paths consistent as `LogicalMeterConfig` evolves — important
now that the rebuild path is correctness-critical for jump recovery.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Lagged means the receiver fell behind enough that the channel dropped
samples — a real data-loss event, not just an internal scheduling
detail. It can fire during a wall-clock jump (the server may burst
samples to refill the buffer) or under sustained back-pressure;
either way operators want to see it without a debug filter.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
`WallClockTimer::try_new` already rejects this internally, but
surfacing the check at the actor boundary gives a slightly more
domain-specific error message ("resampling_interval must be positive")
and pins the contract that callers see InvalidConfig — not whatever
the timer happens to report — for misconfigured intervals.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
The inner `frequenz_resampling::Resampler` takes its max-age as `i32`,
so the actor previously had to clamp `config.max_age_in_intervals`
(`u32`) with `.min(i32::MAX as u32) as i32`. That silently truncated
user configuration without telling them their effective retention
window had been altered.

Validate at construction instead and surface InvalidConfig, so an
out-of-range value fails loudly. The clamp in build_resampler is no
longer needed and the cast becomes a plain `as i32`.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
@shsms shsms force-pushed the resampler-timejump branch from 1af1b40 to c7e476f Compare April 30, 2026 15:57
shsms added 2 commits May 4, 2026 09:39
`new()` slept up to ~1 s of wall-clock time aligning to the next whole
second so the mock's telemetry timestamps would land on the resampler's
interval boundaries. The anchor is now built directly from
`next_sec_secs` and handed to `TokioSyncedClock::with_wall_anchor`, so
the boundary is reported at construction without waiting for real time
to reach it. The sleep was incidentally syncing the mock's clock with a
later `TokioSyncedClock::new()` (anchored to `Utc::now()`) that
`new_logical_meter_handle` was passing to `LogicalMeterHandle`; expose
the mock's clock via a new `clock()` accessor and share it with the
actor so the two sides remain aligned without depending on real wall
time.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
Address every lint clippy reports under `--all-targets`:

- `inconsistent_digit_grouping` in a frequency test (regroup the
  literal at the standard 3-digit boundary).
- `excessive_precision` in current/energy formatting tests (truncate
  the f32 literals to a representable mantissa; the formatter still
  rounds to the asserted output).
- `useless_vec` in two `bounds` test fixtures (use array literals).
- `redundant_closure` in a `logical_meter_handle` test helper (pass
  the function reference directly).
- `clone_on_copy` on `Option<Timestamp>` and a stray `into_iter()` in
  `test_utils`'s mock telemetry stream.
- `collapsible_if` in `receive_electrical_component_telemetry_stream`
  (fold the two guards into a single let-chain, edition 2024).
- `type_complexity` on `MockComponent::metrics` (factor out a
  `MockMetricRow` typedef).

`cargo fmt --check` was already clean, no formatting changes.

Signed-off-by: Sahas Subramanian <sahas.subramanian@proton.me>
@shsms shsms force-pushed the resampler-timejump branch from c7e476f to 05d64e2 Compare May 4, 2026 09:51
@shsms shsms requested a review from llucax May 4, 2026 09:54
Copy link
Copy Markdown

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving from the logical part, not sure if anyone else with more rust-fu should approve too, I leave that to you to decide.

@shsms shsms force-pushed the resampler-timejump branch from 05d64e2 to 1cb49de Compare May 4, 2026 13:39
@shsms shsms added this pull request to the merge queue May 4, 2026
Merged via the queue into frequenz-floss:v0.x.x with commit ef35ed6 May 4, 2026
3 checks passed
@shsms shsms deleted the resampler-timejump branch May 4, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants