Skip to content

ComponentTelemetryTracker leaks and logs an error every tick once its mpsc receiver drops #43

Description

@shsms

Symptom

After a BatteryPool is dropped, the per-component ComponentTelemetryTracker tasks it spawned (transitively, via BatteryPoolTelemetryTrackerInverterBatteryGroupTelemetryTrackerComponentTelemetryTracker) keep running forever. Each one emits

ERROR Failed to send component status: channel closed

at every missing_data_tolerance tick (~1 s) for the lifetime of the process. The error rate scales linearly with the number of batteries × the number of times the pool has been recreated, so a control app that rebuilds its Microgrid on topology changes (switchyard does this — every world-connect triggers a rebuild) fills the log with one extra-line-per-second per rebuild.

The other trackers in the chain (BatteryPoolTelemetryTracker, InverterBatteryGroupTelemetryTracker) exit cleanly on send failure. Only ComponentTelemetryTracker keeps looping.

Root cause

src/microgrid/telemetry_tracker/component_telemetry_tracker.rs, in run()'s interval.tick() arm:

_ = interval.tick() => {
    let status = ComponentHealthStatus::Unhealthy(self.component_id, None);
    if let Err(e) = self.component_status_tx.send(status).await {
        tracing::error!("Failed to send component status: {}", e);
    }
}

When the mpsc receiver is dropped, send returns an error. The match arm logs the error and does not break — so the loop continues, waiting for the next interval.tick(), where it does the same thing again.

The component_data_rx arm exits properly on RecvError::Closed (it drops the tx and breaks the loop), but the upstream MicrogridClientActor keeps its per-component broadcast Sender alive across BatteryPool recreations (the actor outlives any single LM / pool), so the broadcast-closed path never fires.

Suggested fix

Mirror the RecvError::Closed arm and exit the loop when the status-send fails:

_ = interval.tick() => {
    let status = ComponentHealthStatus::Unhealthy(self.component_id, None);
    if self.component_status_tx.send(status).await.is_err() {
        // Downstream receiver is gone — nothing left to consume
        // our status updates. Drop the broadcast subscriber and
        // exit so we don't loop forever.
        break;
    }
}

The same pattern likely belongs anywhere a long-running tracker sends into a receiver-owned-by-the-pool channel — worth auditing future PvPool / EvChargerPool implementations against this shape too.

Caller-side workaround

Filter the frequenz_microgrid::microgrid::telemetry_tracker::component_telemetry_tracker target out of the log subscriber until the upstream fix lands. The trackers themselves are otherwise harmless — orphaned, no measurable CPU draw — only the log spam is user-facing.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Priority

None yet

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions