Skip to content

Prevent sending "Recovered" notification when no prior "Down" notification was sent for short outages #3438

@akashmannil

Description

@akashmannil

A monitor sometimes sends a "Recovered" (Up) notification to Discord even though no "Down" notification was ever sent for that incident. The monitor only failed for ~1 minute (shorter than the configured incident threshold), so a Down notification should not have been sent — and conversely a Recovered notification should not be emitted if there was never a Down notification for the same incident.

Steps to reproduce

  1. Configure a monitor with: Check frequency: 1 minute
  2. Sliding window size: 5 checks
  3. Threshold: 60% (so 3 failing checks required to mark Down)
  4. Allow a single check to fail (one failed ping) and then have the monitor succeed on the next check (so the total outage is ~1 minute).
  5. Observe notifications delivered to configured notification channels (Discord, etc).

Observed behavior (actual)

A "Recovered" / Monitor Recovered notification is delivered to Discord after the service returns to Up, despite no "Down" notification having been delivered earlier for the same outage.

Expected behavior

No "Recovered" notification should be sent unless a "Down" notification was previously sent for that incident.
Alternatively, add a configuration option allowing teams to suppress Recovered-only notifications (i.e., only notify on recovery when a Down notification was sent).
Relevant configuration

Check frequency: 1 minute
Sliding window checks: 5
Threshold: 60% (=> 3/5 checks required to mark Down) These settings imply an outage must persist ~3 minutes to be reported as Down; the observed outage was ~1 minute so no Down notification should have been issued (see image2).

Possible root causes / hypotheses

Notification logic is triggered on any state change derived from the latest check results, but notification suppression for the “Down” event (because the sliding window hadn’t yet met the threshold) is not being coordinated with the “Recovered” notification logic.
The system may announce a recovery when a single recent check is successful if it derives the current status from the latest check instead of requiring that a previously-sent incident exists.
The notification pipeline may not track whether an incident/Down notification has been emitted for the current outage, allowing “Recovered” messages to be emitted for transient failures.

Image Image Image

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions