Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions nutanix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,18 @@ Use the `collect_events`, `collect_alerts`, `collect_tasks`, and `collect_audits

**Note**: By default, only parent tasks are collected. Set `collect_subtasks: true` to include subtasks.

**Alert lifecycle.** Alerts are reconciled against Prism Central's unresolved-alerts API on every check cycle. While an alert is open, a heartbeat event (`msg_title: Alert: ...`) is emitted each cycle so event-based monitors stay firing; the first occurrence acts as the creation event. Transition events are emitted when an alert is acknowledged or reopened, and a resolution event is emitted when the alert is resolved or deleted. All events for the same alert share `aggregation_key=nutanix-alert-<extId>`, which collapses them into a single entry in the Events Explorer.

**Agent restart.** The integration is stateless across restarts. On startup it fetches all currently-unresolved alerts and re-emits a heartbeat event for each; `aggregation_key` collapses these duplicates with any prior events. State changes (acknowledgement, reopening) that happen during Agent downtime are not retroactively emitted as transition events. The next check cycle picks up the current state and proceeds normally.

**Building metric-based monitors for alerts.** The state of an alert is captured by `nutanix.alert.open` and `nutanix.alert.acknowledged` (gauges). `nutanix.alert.resolved` is a `count` of resolution transitions, not a state. Recommended patterns:

- Active alerts: `avg:nutanix.alert.open{*}.default_zero() > 0` by `ntnx_alert_ext_id`.
- Active or acknowledged: `avg:nutanix.alert.open{*} + avg:nutanix.alert.acknowledged{*}` with `default_zero` and threshold `> 0`, grouped by `ntnx_alert_ext_id`.
- Resolution rate: `sum:nutanix.alert.resolved{*}.as_count()` for dashboards or backlog monitors.

Because `nutanix.alert.resolved` is a count, do not subtract it from the open or acknowledged gauges; an alert can transition from resolved back to open with the same `ntnx_alert_ext_id`, and `.open` alone is the correct state signal.

### Service Checks

The integration does not emit any service checks.
Expand Down
35 changes: 35 additions & 0 deletions nutanix/assets/monitors/alerts.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"version": 2,
"created_at": "2026-05-14",
"last_updated_at": "2026-05-14",
"title": "Nutanix alert is open in Prism Central",
"description": "Tracks open Nutanix alerts from Prism Central. Fires when an alert is unresolved, auto-recovers on acknowledgement or resolution. Lifecycle events (created, acknowledged, reopened, resolved) are emitted to the Events Explorer under the same aggregation key.",
"definition": {
"id": 281752599,
"name": "{{#is_alert}}OPEN{{/is_alert}}{{#is_recovery}}RESOLVED{{/is_recovery}} [Nutanix {{ntnx_alert_severity.name}}] [{{ntnx_alert_impact.name}}] [{{ntnx_originating_cluster_name.name}}] - {{ntnx_alert_ext_id.name}}",
"type": "event-v2 alert",
"query": "events(\"source:nutanix ntnx_type:alert ntnx_alert_status:open\").rollup(\"cardinality\", \"@aggregation_key\").by(\"@aggregation_key,ntnx_alert_severity,ntnx_alert_impact,ntnx_originating_cluster_name\").last(\"5m\") > 0",
"message": "{{#is_alert}}A Nutanix alert has been raised or escalated.{{/is_alert}}\n {{#is_recovery}}The underlying Nutanix alert has recovered.{{/is_recovery}}\n\n **Alert:** `{{@aggregation_key.name}}`\n **Severity:** `{{ntnx_alert_severity.name}}`\n **Impact:** `{{ntnx_alert_impact.name}}`\n **Originating cluster:** `{{ntnx_originating_cluster_name.name}}`\n\n **Transition observed at:** {{last_triggered_at}}",
"tags": [],
"options": {
"thresholds": {
"critical": 0
},
"enable_logs_sample": false,
"notify_audit": false,
"on_missing_data": "default",
"include_tags": true,
"new_group_delay": 60,
"renotify_interval": 0,
"escalation_message": "",
"silenced": {}
},
"priority": null,
"restriction_policy": {
"bindings": []
}
},
"tags": [
"integration:nutanix"
]
}
1 change: 1 addition & 0 deletions nutanix/changelog.d/23538.added
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Track each Nutanix alert through its lifecycle (open, acknowledged, resolved) with dedicated metrics, transition events, and a default monitor template.
409 changes: 291 additions & 118 deletions nutanix/datadog_checks/nutanix/activity_monitor.py

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion nutanix/datadog_checks/nutanix/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def audits(self):

@property
def alerts(self):
return self.activity_monitor.alerts
return self.activity_monitor._open_alerts

@property
def tasks(self):
Expand Down
4 changes: 3 additions & 1 deletion nutanix/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@
"Nutanix - Overview": "assets/dashboards/nutanix_overview.json",
"Nutanix - Activity Monitoring": "assets/dashboards/nutanix_activity_monitoring.json"
},
"monitors": {},
"monitors": {
"Nutanix alert is open": "assets/monitors/alerts.json"
},
"saved_views": {}
},
"author": {
Expand Down
3 changes: 3 additions & 0 deletions nutanix/metadata.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name,curated_metric,sample_tags
nutanix.alert.acknowledged,gauge,,,,1 while a Nutanix alert is acknowledged but not yet resolved; 0 emitted once when leaving the acknowledged state. Tagged per-alert via ntnx_alert_ext_id.,0,nutanix,alert acknowledged,,ntnx_alert_ext_id
nutanix.alert.open,gauge,,,,1 while a Nutanix alert is unresolved and unacknowledged; 0 emitted once when leaving the open state (acknowledged or resolved). Tagged per-alert via ntnx_alert_ext_id.,0,nutanix,alert open,,ntnx_alert_ext_id
nutanix.alert.resolved,count,,,,"Incremented once each time a Nutanix alert is detected as resolved or deleted. Use for resolution-rate dashboards or backlog monitors; not a state metric, since alerts can transition from resolved back to open with the same ntnx_alert_ext_id. Use nutanix.alert.open for state.",0,nutanix,alert resolved,,ntnx_alert_ext_id
nutanix.api.rate_limited,count,,,,Count of HTTP 429 rate limit responses from the Prism Central API.,0,nutanix,rate_limited,,
nutanix.cluster.aggregate_hypervisor.memory_usage,gauge,,,,Total memory usage across all hypervisors in the cluster.,0,nutanix,usage,,
nutanix.cluster.controller.avg_io_latency,gauge,,,,Average I/O latency of the cluster storage controller.,0,nutanix,latency,,
Expand Down
113 changes: 36 additions & 77 deletions nutanix/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,24 @@

import json
import os
from datetime import datetime

import pytest
from requests.exceptions import HTTPError

from datadog_checks.dev import docker_run, get_docker_hostname, get_here
from datadog_checks.dev.conditions import CheckEndpoints


def _filter_after(records, field, filter_param):
"""Filter & sort records whose `field` ISO-8601 timestamp is after the value in `<field> gt …`."""
threshold = datetime.fromisoformat(filter_param.split(f"{field} gt ")[-1].strip().replace("Z", "+00:00"))
return sorted(
(r for r in records if r.get(field) and datetime.fromisoformat(r[field].replace("Z", "+00:00")) > threshold),
key=lambda r: datetime.fromisoformat(r[field].replace("Z", "+00:00")),
)


HERE = get_here()
HOST = get_docker_hostname()
DOCKER_DIR = os.path.join(HERE, 'docker')
Expand Down Expand Up @@ -43,6 +55,15 @@ def load_fixture_page(filename, page):
return {"data": [], "metadata": {"totalAvailableResults": 0}}


def fixture_alert(alert_type, **overrides):
"""Load the first fixture alert with the given alertType and apply overrides."""
for page in load_fixture('alerts.json'):
for alert in page.get('data', []):
if alert.get('alertType') == alert_type:
return {**alert, **overrides}
raise ValueError(f"No alert with alertType={alert_type} in fixture")


# Test instance configurations
INSTANCE = {
"pc_ip": "10.0.0.197",
Expand Down Expand Up @@ -232,25 +253,8 @@ def mock_response(url, params=None, *args, **kwargs):

filter_param = params.get('$filter', '') if params else ''
if 'creationTime gt' in filter_param:
from datetime import datetime

filter_time_str = filter_param.split('creationTime gt ')[-1].strip()
filter_time = datetime.fromisoformat(filter_time_str.replace('Z', '+00:00'))

filtered_data = []
for event in response_data.get('data', []):
event_time_str = event.get('creationTime', '')
if event_time_str:
event_time = datetime.fromisoformat(event_time_str.replace('Z', '+00:00'))
if event_time > filter_time:
filtered_data.append(event)

filtered_data.sort(
key=lambda t: datetime.fromisoformat(t.get('creationTime', '').replace('Z', '+00:00'))
)

response_data = dict(response_data)
response_data['data'] = filtered_data
response_data['data'] = _filter_after(response_data.get('data', []), 'creationTime', filter_param)

mock_resp.json = mocker.Mock(return_value=response_data)
return mock_resp
Expand All @@ -260,33 +264,16 @@ def mock_response(url, params=None, *args, **kwargs):

filter_param = params.get('$filter', '') if params else ''
if 'creationTime gt' in filter_param:
from datetime import datetime

filter_time_str = filter_param.split('creationTime gt ')[-1].strip()
filter_time = datetime.fromisoformat(filter_time_str.replace('Z', '+00:00'))

filtered_data = []
for audit in response_data.get('data', []):
audit_time_str = audit.get('creationTime', '')
if audit_time_str:
audit_time = datetime.fromisoformat(audit_time_str.replace('Z', '+00:00'))
if audit_time > filter_time:
filtered_data.append(audit)

filtered_data.sort(
key=lambda t: datetime.fromisoformat(t.get('creationTime', '').replace('Z', '+00:00'))
)

response_data = dict(response_data)
response_data['data'] = filtered_data
response_data['data'] = _filter_after(response_data.get('data', []), 'creationTime', filter_param)

mock_resp.json = mocker.Mock(return_value=response_data)
return mock_resp

# Individual alert fetch by ID (e.g. /alerts/{uuid})
import re

alert_id_match = re.search(r'api/monitoring/v4\.\d/serviceability/alerts/([0-9a-f-]{36})', url)
alert_id_match = re.search(r'api/monitoring/v4\.0/serviceability/alerts/([0-9a-f-]{36})', url)
if alert_id_match:
alert_ext_id = alert_id_match.group(1)
all_alerts = load_fixture_page("alerts.json", 0).get('data', [])
Expand All @@ -295,33 +282,22 @@ def mock_response(url, params=None, *args, **kwargs):
mock_resp.json = mocker.Mock(return_value={"data": alert_data})
else:
mock_resp.status_code = 404
mock_resp.raise_for_status = mocker.Mock(side_effect=Exception("404 Not Found"))
mock_resp.raise_for_status = mocker.Mock(side_effect=HTTPError(response=mock_resp))
return mock_resp

if 'api/monitoring/v4.0/serviceability/alerts' in url or 'api/monitoring/v4.2/serviceability/alerts' in url:
if 'api/monitoring/v4.0/serviceability/alerts' in url:
response_data = load_fixture_page("alerts.json", page)

filter_param = params.get('$filter', '') if params else ''
if 'creationTime gt' in filter_param:
from datetime import datetime

filter_time_str = filter_param.split('creationTime gt ')[-1].strip()
filter_time = datetime.fromisoformat(filter_time_str.replace('Z', '+00:00'))

filtered_data = []
for alert in response_data.get('data', []):
alert_time_str = alert.get('creationTime', '')
if alert_time_str:
alert_time = datetime.fromisoformat(alert_time_str.replace('Z', '+00:00'))
if alert_time > filter_time:
filtered_data.append(alert)

filtered_data.sort(
key=lambda t: datetime.fromisoformat(t.get('creationTime', '').replace('Z', '+00:00'))
)

if 'isResolved eq false' in filter_param:
response_data = dict(response_data)
response_data['data'] = filtered_data
response_data['data'] = [a for a in response_data.get('data', []) if not a.get('isResolved')]
elif 'lastUpdatedTime gt' in filter_param:
response_data = dict(response_data)
response_data['data'] = _filter_after(response_data.get('data', []), 'lastUpdatedTime', filter_param)
elif 'creationTime gt' in filter_param:
response_data = dict(response_data)
response_data['data'] = _filter_after(response_data.get('data', []), 'creationTime', filter_param)

mock_resp.json = mocker.Mock(return_value=response_data)
return mock_resp
Expand All @@ -330,32 +306,15 @@ def mock_response(url, params=None, *args, **kwargs):

filter_param = params.get('$filter', '') if params else ''
if 'createdTime gt' in filter_param:
from datetime import datetime

filter_time_str = filter_param.split('createdTime gt ')[-1].strip()
filter_time = datetime.fromisoformat(filter_time_str.replace('Z', '+00:00'))

filtered_data = []
for task in response_data.get('data', []):
task_time_str = task.get('createdTime', '')
if task_time_str:
task_time = datetime.fromisoformat(task_time_str.replace('Z', '+00:00'))
if task_time > filter_time:
filtered_data.append(task)

filtered_data.sort(
key=lambda t: datetime.fromisoformat(t.get('createdTime', '').replace('Z', '+00:00'))
)

response_data = dict(response_data)
response_data['data'] = filtered_data
response_data['data'] = _filter_after(response_data.get('data', []), 'createdTime', filter_param)

mock_resp.json = mocker.Mock(return_value=response_data)
return mock_resp

print(f"[MOCK ERROR] No matching endpoint for URL: {url}")
mock_resp.status_code = 404
mock_resp.raise_for_status = mocker.Mock(side_effect=Exception("404 Not Found"))
mock_resp.raise_for_status = mocker.Mock(side_effect=HTTPError(response=mock_resp))
return mock_resp

return mocker.patch('requests.Session.get', side_effect=mock_response)
1 change: 0 additions & 1 deletion nutanix/tests/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ The Flask server mocks the following Nutanix Prism Central v4 APIs:
- `GET /api/monitoring/v4.0/serviceability/events` - List events (paginated, time-filtered)
- `GET /api/monitoring/v4.0/serviceability/audits` - List audits (paginated, time-filtered)
- `GET /api/monitoring/v4.0/serviceability/alerts` - List alerts (paginated, time-filtered)
- `GET /api/monitoring/v4.2/serviceability/alerts` - List alerts v4.2 (paginated, time-filtered)
- `GET /api/prism/v4.0/config/tasks` - List tasks (paginated, time-filtered)

### Metadata APIs
Expand Down
1 change: 0 additions & 1 deletion nutanix/tests/docker/mock_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ def audits():


@app.route('/api/monitoring/v4.0/serviceability/alerts')
@app.route('/api/monitoring/v4.2/serviceability/alerts')
def alerts():
"""Alerts endpoint (paginated with time filtering)."""
page = int(request.args.get('$page', 0))
Expand Down
6 changes: 6 additions & 0 deletions nutanix/tests/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@
# Host storage_* metric names — derived so test guards stay in sync with the production map.
HOST_STORAGE_METRICS: frozenset[str] = frozenset(f"nutanix.{HOST_STATS_METRICS[k]}" for k in HOST_STORAGE_STAT_KEYS)

ALERT_METRICS_OPTIONAL = [
"nutanix.alert.open",
"nutanix.alert.acknowledged",
"nutanix.alert.resolved",
]

CLUSTER_STATS_METRICS_REQUIRED = [
"nutanix.cluster.aggregate_hypervisor.memory_usage",
"nutanix.cluster.controller.avg_io_latency",
Expand Down
27 changes: 6 additions & 21 deletions nutanix/tests/scripts/record_fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,34 +395,19 @@ def record_audits() -> None:


def record_alerts() -> None:
"""Record alerts fixture."""
# Get alerts from last 24 hours
now = datetime.now(timezone.utc)
start_time = now - timedelta(hours=24)
start_time_str = start_time.isoformat().replace("+00:00", "Z")

print(f"\nRecording alerts (from {start_time_str})")
"""Record alerts fixture (currently-unresolved snapshot, matching production query)."""
print("\nRecording unresolved alerts")

params = {
"$filter": f"creationTime gt {start_time_str}",
"$orderBy": "creationTime asc",
"$filter": "isResolved eq false",
"$orderBy": "lastUpdatedTime asc",
}

# Try v4.2 first, fallback to v4.0
try:
print(" Trying alerts API v4.2...")
pages = fetch_paginated_endpoint("api/monitoring/v4.2/serviceability/alerts", params=params)
pages = fetch_paginated_endpoint("api/monitoring/v4.0/serviceability/alerts", params=params)
save_fixture("alerts.json", pages)
except requests.exceptions.HTTPError as e:
print(f" ⚠ v4.2 failed: {e}")
try:
print(" Falling back to alerts API v4.0...")
# v4.0 doesn't support filters
params_v40 = {}
pages = fetch_paginated_endpoint("api/monitoring/v4.0/serviceability/alerts", params=params_v40)
save_fixture("alerts.json", pages)
except requests.exceptions.HTTPError as e2:
print(f" ⚠ v4.0 also failed: {e2}")
print(f" ⚠ alerts fetch failed: {e}")


def record_tasks() -> None:
Expand Down
Loading
Loading