[pull] master from DataDog:master by pull[bot] · Pull Request #592 · ConnectionMaster/integrations-core

pull · 2026-06-10T08:26:24Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat(nutanix): alert lifecycle tracking with per-state metrics Replace the cursor-based alert collection with a reconciliation loop against the v4.0 unresolved-alerts API. Ship per-state lifecycle gauges and a default monitor template: - nutanix.alert.open — 1 while alert is unresolved + unacknowledged - nutanix.alert.acknowledged — 1 while alert is unresolved + acknowledged - nutanix.alert.resolved — 1 once when alert enters the resolved state State transitions emit explicit zeros to the previous state's metric so per-alert monitor cases recover cleanly when the alert leaves a state. Each metric carries ext_id for monitor grouping; the metric name itself encodes the state, so monitor queries don't need a tag filter. Lifecycle events (in addition to the metrics): - "Alert: <title>" — created (or re-opened from resolved) - "Alert acknowledged: <title>" — open -> acknowledged transition - "Alert reopened: <title>" — acknowledged -> open transition - "Alert Resolved: <title>" — resolution with resolvedTime / by / auto_resolved metadata Reconciliation is the source of truth each cycle: alerts in the API but not in the in-memory cache are new (emit open event); alerts in the cache but absent from the API are resolved or deleted (emit resolution event + .open or .acknowledged = 0, .resolved = 1); alerts in both have their cached metadata refreshed and ack-state transitions emit dedicated events. Stateless across check cycles in terms of persistence — agent restarts re-derive state from the API; the aggregation_key collapses any visible duplicate creation events on restart. Hardening: - on transient API failure, re-emit cached gauges before re-raising so per-alert monitors don't auto-resolve while the alert is still open. - pre-compute new/gone/still-tracked sets before mutating _open_alerts so loop ordering is safe. - v4.2 fallback removed; v4.0 endpoint with $filter=isResolved eq false is the only path. The pre-existing client-side filter remains as a safety net. Tags added to alert events and metrics: - ext_id, ntnx_alert_type, ntnx_alert_severity, ntnx_alert_status (events only — redundant on metrics where the name encodes state) - ntnx_originating_cluster_name, ntnx_alert_user_defined, ntnx_alert_service (Tier 1 — distinguish federated cluster, custom vs platform alerts, and Nutanix subsystem when present) - ntnx_cluster_name, ntnx_alert_classification, ntnx_alert_impact, ntnx_alert_auto_resolved (resolution events only), source-entity tags Default monitor template at assets/monitors/alerts.json combines nutanix.alert.open + nutanix.alert.acknowledged minus nutanix.alert.resolved to alert on any unresolved alert (clamped to non-negative). Auto-resolves on the resolved one-shot. Description notes the agent-restart re-broadcast trade-off. Test coverage: state transitions (open<->ack, ack->resolved from each prior state), filter-add edge case (treated as spurious resolution), deleted-alert (_get_alert returns None) graceful fallback, empty unresolved list cold-start, and per-tag assertions for the new Tier 1 tags. The four "complete output" alertType tests are parametrized. conftest mock has a _filter_after helper for the time-based fixture branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(nutanix): consolidate alert changelog entries Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(nutanix): revert version bump and shorten changelog Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * working monitor * feat(nutanix): heartbeat open alerts each cycle for event-based monitors Re-emit one alert event per tracked alert per check cycle so event-count monitors (last("Nm") > 0) stay firing while the alert is open. Transition cycles skip the heartbeat — the dedicated transition event already lands under the same aggregation_key. Resolved alerts are popped from _open_alerts before the heartbeat loop, so they don't get a duplicate heartbeat alongside their resolution event. Ship the default monitor template at assets/monitors/alerts.json with a real title and description. Tests cover the new heartbeat skip-list (transitions, resolutions, filter-exclusion), aggregation_key consistency across the full alert lifecycle, and the cached-gauges-no-events contract on transient API failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(nutanix): revert version bump and simplify monitor threshold Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(nutanix): address review feedback on alert lifecycle tracking - Drop the leaky self.alerts cache; _get_alert is now a one-shot fetch. - Warn loudly when the client-side isResolved safety net drops alerts. - Note that lastUpdatedTime is the closest signal for ack->open transitions. - Fix triple-space in monitor title, fix date format to YYYY-MM-DD. - Document the alert lifecycle and agent-restart behavior in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nutanix): address alert tracking review feedback - Repoint NutanixCheck.alerts at _open_alerts so the public property matches what the activity monitor actually carries. - Restore a per-cycle dedup cache on _get_alert so _process_task does not issue N x M GETs for tasks referencing the same alerts. - Extract _reconcile_alerts into named helpers (new, resolved, transitioned, heartbeat, cached-gauge fallback) so the coordinator reads cleanly. - Namespace ext_id as ntnx_alert_ext_id on metrics, events, the monitor template, and metadata.csv to avoid colliding with other sources in the global ext_id tag. - Log a warning when the unresolved-alerts list call fails, and another when a gone alert cannot be fetched back from Prism Central. - Switch nutanix.alert.resolved from gauge to count so a resolved alert reopening with the same extId does not leave a stuck resolved=1 series. - Monitor template: query threshold > 0, use is_recovery consistently. - Update record_fixtures.py to use the production isResolved eq false filter so re-recorded fixtures match what the integration queries. - Document the alert lifecycle, agent restart behavior, and recommended metric monitor patterns in the README. - Add a test for the resolved to open lifecycle with the same extId. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nutanix): quote metadata.csv description containing a comma The nutanix.alert.resolved description contained an unquoted comma, splitting the row into 12 columns at parse time and breaking metadata validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nutanix): correct alert reconciliation edge cases from review Two distinct correctness fixes to the alert lifecycle tracking path: - _get_alert no longer swallows every exception. A 404 still returns None (the alert was deleted upstream), but transient HTTP failures (5xx, network, timeout) propagate so they are not silently misclassified as deletions and emit degraded "Resolved" events. _emit_resolved_alerts catches the propagated HTTPError per-alert, restores tracking, and retries on the next cycle. Returned count now reflects actual emissions. - _reconcile_alerts now distinguishes alerts that truly left the unresolved-alerts API (resolved/deleted upstream) from alerts that are still open in Prism Central but no longer match the configured resource_filters. The latter are dropped from tracking silently with an info log; no resolution event or nutanix.alert.resolved increment is emitted, since the alert is not resolved. Test updates: - New test_get_alert_returns_none_on_404, test_get_alert_propagates_on_transient_http_error[500/502/503/504], and test_transient_alert_get_failure_preserves_tracking. - Existing test_alert_filter_excludes_tracked_alert_emits_spurious_resolution rewritten as test_alert_filter_excludes_tracked_alert_drops_without_resolution (pinned the prior bug; now asserts the correct behavior). - New test_resolution_event_still_fires_when_alert_truly_leaves_unresolved_api guards the gone_ids semantics regression. - conftest 404 mocks switched from bare Exception to requests.exceptions.HTTPError(response=...) so the 404 branch is actually exercised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com> * fix(nutanix): shorten monitor description to under 300 chars Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(nutanix): address review feedback on alert tracking - Guard the best-effort _get_alert call in _process_task with try/except HTTPError so a transient failure no longer aborts the task collection cycle, matching the guard already used in _emit_resolved_alerts. - Report currently-tracked open alerts in the check summary log instead of the per-cycle change count, which read 0 on quiet cycles. Only the INFO summary is affected; events and metrics are unchanged. - Drop the leading underscore from the SEVERITY_TO_ALERT_TYPE constant. - Move the fixture_alert helper to conftest.py and inline the complete-output parametrize cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(nutanix): stamp open alert event at observation time The open/heartbeat alert event was timestamped at the alert's creationTime. Event monitors window on occurrence time, so back-dated heartbeats never entered the recommended monitor's trailing 5m window: the monitor fired once at creation, then auto-recovered ~5m later and stayed OK regardless of the alert's real state in Prism Central. Stamp the open/heartbeat event at observation time so it lands in the monitor's rolling window. Recovery is now driven by heartbeats ceasing, so the monitor recovers ~5m after the alert actually resolves. Resolution and transition events keep their real timestamps; they are not counted by the status:open query and only feed the timeline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(nutanix): drop fixed changelog fragments Keep only the added fragments on this branch; the fixed entries are removed at the maintainer's request. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>

pull Bot locked and limited conversation to collaborators Jun 10, 2026

pull Bot added the ⤵️ pull label Jun 10, 2026

pull Bot merged commit b8dceee into ConnectionMaster:master Jun 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from DataDog:master#592

[pull] master from DataDog:master#592
pull[bot] merged 1 commit into
ConnectionMaster:masterfrom
DataDog:master

pull Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pull Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pull Bot commented Jun 10, 2026 •

edited

Loading