You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(nutanix): add support for alerts tracking (DataDog#23538)
* feat(nutanix): alert lifecycle tracking with per-state metrics
Replace the cursor-based alert collection with a reconciliation loop
against the v4.0 unresolved-alerts API. Ship per-state lifecycle gauges
and a default monitor template:
- nutanix.alert.open — 1 while alert is unresolved + unacknowledged
- nutanix.alert.acknowledged — 1 while alert is unresolved + acknowledged
- nutanix.alert.resolved — 1 once when alert enters the resolved state
State transitions emit explicit zeros to the previous state's metric so
per-alert monitor cases recover cleanly when the alert leaves a state.
Each metric carries ext_id for monitor grouping; the metric name itself
encodes the state, so monitor queries don't need a tag filter.
Lifecycle events (in addition to the metrics):
- "Alert: <title>" — created (or re-opened from resolved)
- "Alert acknowledged: <title>" — open -> acknowledged transition
- "Alert reopened: <title>" — acknowledged -> open transition
- "Alert Resolved: <title>" — resolution with resolvedTime / by /
auto_resolved metadata
Reconciliation is the source of truth each cycle: alerts in the API but
not in the in-memory cache are new (emit open event); alerts in the cache
but absent from the API are resolved or deleted (emit resolution event +
.open or .acknowledged = 0, .resolved = 1); alerts in both have their
cached metadata refreshed and ack-state transitions emit dedicated
events. Stateless across check cycles in terms of persistence — agent
restarts re-derive state from the API; the aggregation_key collapses any
visible duplicate creation events on restart.
Hardening:
- on transient API failure, re-emit cached gauges before re-raising so
per-alert monitors don't auto-resolve while the alert is still open.
- pre-compute new/gone/still-tracked sets before mutating _open_alerts
so loop ordering is safe.
- v4.2 fallback removed; v4.0 endpoint with $filter=isResolved eq false
is the only path. The pre-existing client-side filter remains as a
safety net.
Tags added to alert events and metrics:
- ext_id, ntnx_alert_type, ntnx_alert_severity, ntnx_alert_status
(events only — redundant on metrics where the name encodes state)
- ntnx_originating_cluster_name, ntnx_alert_user_defined,
ntnx_alert_service (Tier 1 — distinguish federated cluster, custom
vs platform alerts, and Nutanix subsystem when present)
- ntnx_cluster_name, ntnx_alert_classification, ntnx_alert_impact,
ntnx_alert_auto_resolved (resolution events only), source-entity tags
Default monitor template at assets/monitors/alerts.json combines
nutanix.alert.open + nutanix.alert.acknowledged minus
nutanix.alert.resolved to alert on any unresolved alert (clamped to
non-negative). Auto-resolves on the resolved one-shot. Description
notes the agent-restart re-broadcast trade-off.
Test coverage: state transitions (open<->ack, ack->resolved from each
prior state), filter-add edge case (treated as spurious resolution),
deleted-alert (_get_alert returns None) graceful fallback, empty
unresolved list cold-start, and per-tag assertions for the new Tier 1
tags. The four "complete output" alertType tests are parametrized.
conftest mock has a _filter_after helper for the time-based fixture
branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): consolidate alert changelog entries
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): revert version bump and shorten changelog
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* working monitor
* feat(nutanix): heartbeat open alerts each cycle for event-based monitors
Re-emit one alert event per tracked alert per check cycle so event-count
monitors (last("Nm") > 0) stay firing while the alert is open. Transition
cycles skip the heartbeat — the dedicated transition event already lands
under the same aggregation_key. Resolved alerts are popped from
_open_alerts before the heartbeat loop, so they don't get a duplicate
heartbeat alongside their resolution event.
Ship the default monitor template at assets/monitors/alerts.json with a
real title and description.
Tests cover the new heartbeat skip-list (transitions, resolutions,
filter-exclusion), aggregation_key consistency across the full alert
lifecycle, and the cached-gauges-no-events contract on transient API
failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): revert version bump and simplify monitor threshold
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): address review feedback on alert lifecycle tracking
- Drop the leaky self.alerts cache; _get_alert is now a one-shot fetch.
- Warn loudly when the client-side isResolved safety net drops alerts.
- Note that lastUpdatedTime is the closest signal for ack->open transitions.
- Fix triple-space in monitor title, fix date format to YYYY-MM-DD.
- Document the alert lifecycle and agent-restart behavior in the README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(nutanix): address alert tracking review feedback
- Repoint NutanixCheck.alerts at _open_alerts so the public property
matches what the activity monitor actually carries.
- Restore a per-cycle dedup cache on _get_alert so _process_task does
not issue N x M GETs for tasks referencing the same alerts.
- Extract _reconcile_alerts into named helpers (new, resolved,
transitioned, heartbeat, cached-gauge fallback) so the coordinator
reads cleanly.
- Namespace ext_id as ntnx_alert_ext_id on metrics, events, the
monitor template, and metadata.csv to avoid colliding with other
sources in the global ext_id tag.
- Log a warning when the unresolved-alerts list call fails, and
another when a gone alert cannot be fetched back from Prism Central.
- Switch nutanix.alert.resolved from gauge to count so a resolved
alert reopening with the same extId does not leave a stuck
resolved=1 series.
- Monitor template: query threshold > 0, use is_recovery consistently.
- Update record_fixtures.py to use the production isResolved eq false
filter so re-recorded fixtures match what the integration queries.
- Document the alert lifecycle, agent restart behavior, and recommended
metric monitor patterns in the README.
- Add a test for the resolved to open lifecycle with the same extId.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(nutanix): quote metadata.csv description containing a comma
The nutanix.alert.resolved description contained an unquoted comma,
splitting the row into 12 columns at parse time and breaking metadata
validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(nutanix): correct alert reconciliation edge cases from review
Two distinct correctness fixes to the alert lifecycle tracking path:
- _get_alert no longer swallows every exception. A 404 still returns None
(the alert was deleted upstream), but transient HTTP failures (5xx,
network, timeout) propagate so they are not silently misclassified as
deletions and emit degraded "Resolved" events. _emit_resolved_alerts
catches the propagated HTTPError per-alert, restores tracking, and
retries on the next cycle. Returned count now reflects actual emissions.
- _reconcile_alerts now distinguishes alerts that truly left the
unresolved-alerts API (resolved/deleted upstream) from alerts that are
still open in Prism Central but no longer match the configured
resource_filters. The latter are dropped from tracking silently with an
info log; no resolution event or nutanix.alert.resolved increment is
emitted, since the alert is not resolved.
Test updates:
- New test_get_alert_returns_none_on_404, test_get_alert_propagates_on_transient_http_error[500/502/503/504], and test_transient_alert_get_failure_preserves_tracking.
- Existing test_alert_filter_excludes_tracked_alert_emits_spurious_resolution rewritten as test_alert_filter_excludes_tracked_alert_drops_without_resolution (pinned the prior bug; now asserts the correct behavior).
- New test_resolution_event_still_fires_when_alert_truly_leaves_unresolved_api guards the gone_ids semantics regression.
- conftest 404 mocks switched from bare Exception to requests.exceptions.HTTPError(response=...) so the 404 branch is actually exercised.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Apply suggestions from code review
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
* fix(nutanix): shorten monitor description to under 300 chars
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(nutanix): address review feedback on alert tracking
- Guard the best-effort _get_alert call in _process_task with try/except
HTTPError so a transient failure no longer aborts the task collection
cycle, matching the guard already used in _emit_resolved_alerts.
- Report currently-tracked open alerts in the check summary log instead
of the per-cycle change count, which read 0 on quiet cycles. Only the
INFO summary is affected; events and metrics are unchanged.
- Drop the leading underscore from the SEVERITY_TO_ALERT_TYPE constant.
- Move the fixture_alert helper to conftest.py and inline the
complete-output parametrize cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nutanix): stamp open alert event at observation time
The open/heartbeat alert event was timestamped at the alert's
creationTime. Event monitors window on occurrence time, so back-dated
heartbeats never entered the recommended monitor's trailing 5m window:
the monitor fired once at creation, then auto-recovered ~5m later and
stayed OK regardless of the alert's real state in Prism Central.
Stamp the open/heartbeat event at observation time so it lands in the
monitor's rolling window. Recovery is now driven by heartbeats ceasing,
so the monitor recovers ~5m after the alert actually resolves.
Resolution and transition events keep their real timestamps; they are
not counted by the status:open query and only feed the timeline.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(nutanix): drop fixed changelog fragments
Keep only the added fragments on this branch; the fixed entries are
removed at the maintainer's request.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
Copy file name to clipboardExpand all lines: nutanix/README.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,6 +63,18 @@ Use the `collect_events`, `collect_alerts`, `collect_tasks`, and `collect_audits
63
63
64
64
**Note**: By default, only parent tasks are collected. Set `collect_subtasks: true` to include subtasks.
65
65
66
+
**Alert lifecycle.** Alerts are reconciled against Prism Central's unresolved-alerts API on every check cycle. While an alert is open, a heartbeat event (`msg_title: Alert: ...`) is emitted each cycle so event-based monitors stay firing; the first occurrence acts as the creation event. Transition events are emitted when an alert is acknowledged or reopened, and a resolution event is emitted when the alert is resolved or deleted. All events for the same alert share `aggregation_key=nutanix-alert-<extId>`, which collapses them into a single entry in the Events Explorer.
67
+
68
+
**Agent restart.** The integration is stateless across restarts. On startup it fetches all currently-unresolved alerts and re-emits a heartbeat event for each; `aggregation_key` collapses these duplicates with any prior events. State changes (acknowledgement, reopening) that happen during Agent downtime are not retroactively emitted as transition events. The next check cycle picks up the current state and proceeds normally.
69
+
70
+
**Building metric-based monitors for alerts.** The state of an alert is captured by `nutanix.alert.open` and `nutanix.alert.acknowledged` (gauges). `nutanix.alert.resolved` is a `count` of resolution transitions, not a state. Recommended patterns:
71
+
72
+
- Active alerts: `avg:nutanix.alert.open{*}.default_zero() > 0` by `ntnx_alert_ext_id`.
73
+
- Active or acknowledged: `avg:nutanix.alert.open{*} + avg:nutanix.alert.acknowledged{*}` with `default_zero` and threshold `> 0`, grouped by `ntnx_alert_ext_id`.
74
+
- Resolution rate: `sum:nutanix.alert.resolved{*}.as_count()` for dashboards or backlog monitors.
75
+
76
+
Because `nutanix.alert.resolved` is a count, do not subtract it from the open or acknowledged gauges; an alert can transition from resolved back to open with the same `ntnx_alert_ext_id`, and `.open` alone is the correct state signal.
"title": "Nutanix alert is open in Prism Central",
6
+
"description": "Tracks open Nutanix alerts from Prism Central. Fires when an alert is unresolved, auto-recovers on acknowledgement or resolution. Lifecycle events (created, acknowledged, reopened, resolved) are emitted to the Events Explorer under the same aggregation key.",
Track each Nutanix alert through its lifecycle (open, acknowledged, resolved) with dedicated metrics, transition events, and a default monitor template.
nutanix.alert.acknowledged,gauge,,,,1 while a Nutanix alert is acknowledged but not yet resolved; 0 emitted once when leaving the acknowledged state. Tagged per-alert via ntnx_alert_ext_id.,0,nutanix,alert acknowledged,,ntnx_alert_ext_id
3
+
nutanix.alert.open,gauge,,,,1 while a Nutanix alert is unresolved and unacknowledged; 0 emitted once when leaving the open state (acknowledged or resolved). Tagged per-alert via ntnx_alert_ext_id.,0,nutanix,alert open,,ntnx_alert_ext_id
4
+
nutanix.alert.resolved,count,,,,"Incremented once each time a Nutanix alert is detected as resolved or deleted. Use for resolution-rate dashboards or backlog monitors; not a state metric, since alerts can transition from resolved back to open with the same ntnx_alert_ext_id. Use nutanix.alert.open for state.",0,nutanix,alert resolved,,ntnx_alert_ext_id
2
5
nutanix.api.rate_limited,count,,,,Count of HTTP 429 rate limit responses from the Prism Central API.,0,nutanix,rate_limited,,
3
6
nutanix.cluster.aggregate_hypervisor.memory_usage,gauge,,,,Total memory usage across all hypervisors in the cluster.,0,nutanix,usage,,
4
7
nutanix.cluster.controller.avg_io_latency,gauge,,,,Average I/O latency of the cluster storage controller.,0,nutanix,latency,,
0 commit comments