Skip to content

[pull] master from DataDog:master#592

Merged
pull[bot] merged 1 commit into
ConnectionMaster:masterfrom
DataDog:master
Jun 10, 2026
Merged

[pull] master from DataDog:master#592
pull[bot] merged 1 commit into
ConnectionMaster:masterfrom
DataDog:master

Conversation

@pull

@pull pull Bot commented Jun 10, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat(nutanix): alert lifecycle tracking with per-state metrics

Replace the cursor-based alert collection with a reconciliation loop
against the v4.0 unresolved-alerts API. Ship per-state lifecycle gauges
and a default monitor template:

  - nutanix.alert.open       — 1 while alert is unresolved + unacknowledged
  - nutanix.alert.acknowledged — 1 while alert is unresolved + acknowledged
  - nutanix.alert.resolved   — 1 once when alert enters the resolved state

State transitions emit explicit zeros to the previous state's metric so
per-alert monitor cases recover cleanly when the alert leaves a state.
Each metric carries ext_id for monitor grouping; the metric name itself
encodes the state, so monitor queries don't need a tag filter.

Lifecycle events (in addition to the metrics):

  - "Alert: <title>"               — created (or re-opened from resolved)
  - "Alert acknowledged: <title>"  — open -> acknowledged transition
  - "Alert reopened: <title>"      — acknowledged -> open transition
  - "Alert Resolved: <title>"      — resolution with resolvedTime / by /
                                     auto_resolved metadata

Reconciliation is the source of truth each cycle: alerts in the API but
not in the in-memory cache are new (emit open event); alerts in the cache
but absent from the API are resolved or deleted (emit resolution event +
.open or .acknowledged = 0, .resolved = 1); alerts in both have their
cached metadata refreshed and ack-state transitions emit dedicated
events. Stateless across check cycles in terms of persistence — agent
restarts re-derive state from the API; the aggregation_key collapses any
visible duplicate creation events on restart.

Hardening:
- on transient API failure, re-emit cached gauges before re-raising so
  per-alert monitors don't auto-resolve while the alert is still open.
- pre-compute new/gone/still-tracked sets before mutating _open_alerts
  so loop ordering is safe.
- v4.2 fallback removed; v4.0 endpoint with $filter=isResolved eq false
  is the only path. The pre-existing client-side filter remains as a
  safety net.

Tags added to alert events and metrics:
  - ext_id, ntnx_alert_type, ntnx_alert_severity, ntnx_alert_status
    (events only — redundant on metrics where the name encodes state)
  - ntnx_originating_cluster_name, ntnx_alert_user_defined,
    ntnx_alert_service (Tier 1 — distinguish federated cluster, custom
    vs platform alerts, and Nutanix subsystem when present)
  - ntnx_cluster_name, ntnx_alert_classification, ntnx_alert_impact,
    ntnx_alert_auto_resolved (resolution events only), source-entity tags

Default monitor template at assets/monitors/alerts.json combines
nutanix.alert.open + nutanix.alert.acknowledged minus
nutanix.alert.resolved to alert on any unresolved alert (clamped to
non-negative). Auto-resolves on the resolved one-shot. Description
notes the agent-restart re-broadcast trade-off.

Test coverage: state transitions (open<->ack, ack->resolved from each
prior state), filter-add edge case (treated as spurious resolution),
deleted-alert (_get_alert returns None) graceful fallback, empty
unresolved list cold-start, and per-tag assertions for the new Tier 1
tags. The four "complete output" alertType tests are parametrized.
conftest mock has a _filter_after helper for the time-based fixture
branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(nutanix): consolidate alert changelog entries

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(nutanix): revert version bump and shorten changelog

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* working monitor

* feat(nutanix): heartbeat open alerts each cycle for event-based monitors

Re-emit one alert event per tracked alert per check cycle so event-count
monitors (last("Nm") > 0) stay firing while the alert is open. Transition
cycles skip the heartbeat — the dedicated transition event already lands
under the same aggregation_key. Resolved alerts are popped from
_open_alerts before the heartbeat loop, so they don't get a duplicate
heartbeat alongside their resolution event.

Ship the default monitor template at assets/monitors/alerts.json with a
real title and description.

Tests cover the new heartbeat skip-list (transitions, resolutions,
filter-exclusion), aggregation_key consistency across the full alert
lifecycle, and the cached-gauges-no-events contract on transient API
failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(nutanix): revert version bump and simplify monitor threshold

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(nutanix): address review feedback on alert lifecycle tracking

- Drop the leaky self.alerts cache; _get_alert is now a one-shot fetch.
- Warn loudly when the client-side isResolved safety net drops alerts.
- Note that lastUpdatedTime is the closest signal for ack->open transitions.
- Fix triple-space in monitor title, fix date format to YYYY-MM-DD.
- Document the alert lifecycle and agent-restart behavior in the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nutanix): address alert tracking review feedback

- Repoint NutanixCheck.alerts at _open_alerts so the public property
  matches what the activity monitor actually carries.
- Restore a per-cycle dedup cache on _get_alert so _process_task does
  not issue N x M GETs for tasks referencing the same alerts.
- Extract _reconcile_alerts into named helpers (new, resolved,
  transitioned, heartbeat, cached-gauge fallback) so the coordinator
  reads cleanly.
- Namespace ext_id as ntnx_alert_ext_id on metrics, events, the
  monitor template, and metadata.csv to avoid colliding with other
  sources in the global ext_id tag.
- Log a warning when the unresolved-alerts list call fails, and
  another when a gone alert cannot be fetched back from Prism Central.
- Switch nutanix.alert.resolved from gauge to count so a resolved
  alert reopening with the same extId does not leave a stuck
  resolved=1 series.
- Monitor template: query threshold > 0, use is_recovery consistently.
- Update record_fixtures.py to use the production isResolved eq false
  filter so re-recorded fixtures match what the integration queries.
- Document the alert lifecycle, agent restart behavior, and recommended
  metric monitor patterns in the README.
- Add a test for the resolved to open lifecycle with the same extId.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nutanix): quote metadata.csv description containing a comma

The nutanix.alert.resolved description contained an unquoted comma,
splitting the row into 12 columns at parse time and breaking metadata
validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nutanix): correct alert reconciliation edge cases from review

Two distinct correctness fixes to the alert lifecycle tracking path:

- _get_alert no longer swallows every exception. A 404 still returns None
  (the alert was deleted upstream), but transient HTTP failures (5xx,
  network, timeout) propagate so they are not silently misclassified as
  deletions and emit degraded "Resolved" events. _emit_resolved_alerts
  catches the propagated HTTPError per-alert, restores tracking, and
  retries on the next cycle. Returned count now reflects actual emissions.

- _reconcile_alerts now distinguishes alerts that truly left the
  unresolved-alerts API (resolved/deleted upstream) from alerts that are
  still open in Prism Central but no longer match the configured
  resource_filters. The latter are dropped from tracking silently with an
  info log; no resolution event or nutanix.alert.resolved increment is
  emitted, since the alert is not resolved.

Test updates:
- New test_get_alert_returns_none_on_404, test_get_alert_propagates_on_transient_http_error[500/502/503/504], and test_transient_alert_get_failure_preserves_tracking.
- Existing test_alert_filter_excludes_tracked_alert_emits_spurious_resolution rewritten as test_alert_filter_excludes_tracked_alert_drops_without_resolution (pinned the prior bug; now asserts the correct behavior).
- New test_resolution_event_still_fires_when_alert_truly_leaves_unresolved_api guards the gone_ids semantics regression.
- conftest 404 mocks switched from bare Exception to requests.exceptions.HTTPError(response=...) so the 404 branch is actually exercised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>

* fix(nutanix): shorten monitor description to under 300 chars

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(nutanix): address review feedback on alert tracking

- Guard the best-effort _get_alert call in _process_task with try/except
  HTTPError so a transient failure no longer aborts the task collection
  cycle, matching the guard already used in _emit_resolved_alerts.
- Report currently-tracked open alerts in the check summary log instead
  of the per-cycle change count, which read 0 on quiet cycles. Only the
  INFO summary is affected; events and metrics are unchanged.
- Drop the leading underscore from the SEVERITY_TO_ALERT_TYPE constant.
- Move the fixture_alert helper to conftest.py and inline the
  complete-output parametrize cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(nutanix): stamp open alert event at observation time

The open/heartbeat alert event was timestamped at the alert's
creationTime. Event monitors window on occurrence time, so back-dated
heartbeats never entered the recommended monitor's trailing 5m window:
the monitor fired once at creation, then auto-recovered ~5m later and
stayed OK regardless of the alert's real state in Prism Central.

Stamp the open/heartbeat event at observation time so it lands in the
monitor's rolling window. Recovery is now driven by heartbeats ceasing,
so the monitor recovers ~5m after the alert actually resolves.

Resolution and transition events keep their real timestamps; they are
not counted by the status:open query and only feed the timeline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(nutanix): drop fixed changelog fragments

Keep only the added fragments on this branch; the fixed entries are
removed at the maintainer's request.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
@pull pull Bot locked and limited conversation to collaborators Jun 10, 2026
@pull pull Bot added the ⤵️ pull label Jun 10, 2026
@pull pull Bot merged commit b8dceee into ConnectionMaster:master Jun 10, 2026
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant