Skip to content

Commit 61fcf15

Browse files
NouemanKHALclaudertrieu
authored
Add option to exclude filtered resources from cluster capacity (DataDog#22997)
* fix cluster capacity counting for hostless/filtered vms * changelog * fix multi cluster case, improve testing fixtures * improve test * address review: use ClusterCapacity for hostless VMs, extract helper, fix test style - Replace dict[str, tuple[int, int]] with dict[str, ClusterCapacity] for _hostless_vm_capacity_by_cluster to align with existing pattern - Extract _extract_vm_capacity helper to eliminate duplication between _accumulate_vm_capacity and _report_vm_capacity_metrics - Add debug log when a hostless VM has no cluster ID - Add comment documenting non-batch mode limitation for hostless VMs - Convert TestMultiClusterHostlessVMCapacity class to plain test functions - Fix misleading comment: OFF VM has a host, it is not a hostless VM - Import HOST_NAME from constants.py instead of redefining locally - Parametrize test_exclude_vm_by_id for both batch modes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * address review: cleanup code, separation of concerns, and flat unit tests * change health_check failure log message to error instead of warning * more refactoring and code cleanup * fix unused HOST_NAME import in test_resource_filters Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * add exclude_filtered_resources_from_cluster_capacity property + regenerate models * separate cluster capacity computation from metrics collection + add exclude_filtered_resources_from_cluster_capacity + cleanup + added tests * address review + more cleanup + improve tests * Update nutanix/README.md Co-authored-by: Rosa Trieu <107086888+rtrieu@users.noreply.github.com> * Address review feedback: reset state before health check, consolidate VM capacity extraction - Move reset_state() before health check to guarantee clean state on every run - Consolidate _extract_vm_capacity to return all fields, eliminating duplicate parsing in _report_vm_capacity_metrics - Fix trailing whitespace and extra blank lines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clarify VM collection mode decoupling in docstrings Make it explicit that _vms_by_host is populated by either batch or non-batch mode, and that hostless VMs (the "" key) are only present in batch mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Include hostless VMs in cluster capacity for all collection modes Add _get_hostless_vms to lazily fetch and cache hostless VMs (the "" key) in non-batch mode. Split _report_cluster_capacity_metrics into two clear loops: hosted VMs (scoped by host_ids) and hostless VMs (scoped by cluster_id). Parameterize hostless VM tests on batch_vm_collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert _get_hostless_vms: hostless VMs only available in batch mode Fetching all VMs to extract hostless ones defeats the purpose of non-batch mode. Hostless VMs are only captured via _build_vms_by_host_cache (batch mode), which is the expected trade-off. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Rosa Trieu <107086888+rtrieu@users.noreply.github.com>
1 parent dba8652 commit 61fcf15

22 files changed

Lines changed: 1066 additions & 613 deletions

nutanix/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,10 @@ resource_filters:
101101

102102
Category tags use the Nutanix category key as the tag name (e.g., `Environment:Production`). Set `prefix_category_tags: true` to prefix them with `ntnx_` (e.g., `ntnx_Environment:Production`) to avoid collisions with existing Datadog tags.
103103

104+
### Cluster capacity planning
105+
106+
Cluster-level capacity metrics (such as `cluster.cpu.total_cores`, `cluster.cpu.vcpus_allocated`, `cluster.memory.allocated_bytes`) aggregate resources from all hosts and VMs. By default, all resources contribute regardless of `resource_filters`. This gives a complete view of provisioned capacity. Set `exclude_filtered_resources_from_cluster_capacity: true` to count only resources that pass filter checks.
107+
104108
### Duplicate hostnames
105109

106110
The Nutanix API does not expose the real hostname of VMs. VM metrics use the VM name from Prism Central as the hostname. If the Datadog Agent is installed on a Nutanix VM, its auto-detected hostname may differ from the VM name, causing duplicate hosts in Datadog. To fix this, set `hostname` in `datadog.yaml` (or the `DD_HOSTNAME` environment variable) to match the VM name in Prism Central.

nutanix/assets/configuration/spec.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,17 @@ files:
191191
value:
192192
type: boolean
193193
example: false
194+
- name: exclude_filtered_resources_from_cluster_capacity
195+
description: |
196+
Whether to exclude filtered resources (hosts and VMs excluded by power state or
197+
resource_filters) from cluster-level capacity metrics.
198+
When false (default), cluster capacity reflects total provisioned resources across
199+
all hosts and VMs, regardless of whether individual metrics are reported.
200+
When true, only resources that pass filter checks contribute to cluster capacity
201+
metrics (total_cores, total_threads, total_bytes, vcpus_allocated, memory_allocated_bytes).
202+
value:
203+
type: boolean
204+
example: false
194205
- name: batch_vm_collection
195206
description: |
196207
Whether to fetch all VMs in a single paginated API call instead of per-host.

nutanix/changelog.d/22997.added

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add `exclude_filtered_resources_from_cluster_capacity` option to control whether filtered resources contribute to cluster capacity metrics.

nutanix/datadog_checks/nutanix/activity_monitor.py

Lines changed: 46 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ def __missing__(self, key):
2727
class ActivityMonitor:
2828
def __init__(self, check: NutanixCheck):
2929
self.check = check
30+
self._pc_label = f"PC:{self.check.pc_ip}:{self.check.pc_port}"
3031
self.last_event_collection_time = self.check.read_persistent_cache("last_event_collection_time")
3132
self.last_task_collection_time = self.check.read_persistent_cache("last_task_collection_time")
3233
self.last_audit_collection_time = self.check.read_persistent_cache("last_audit_collection_time")
@@ -36,6 +37,11 @@ def __init__(self, check: NutanixCheck):
3637
self.audits: dict[str, dict] = {}
3738
self.alerts: dict[str, dict] = {}
3839
self.tasks: dict[str, dict] = {}
40+
# Entity counters
41+
self.events_count = 0
42+
self.tasks_count = 0
43+
self.audits_count = 0
44+
self.alerts_count = 0
3945
# Read boolean flag from cache (stored as string)
4046
cached_value = self.check.read_persistent_cache("alerts_v42_supported")
4147
if cached_value == "True":
@@ -46,11 +52,15 @@ def __init__(self, check: NutanixCheck):
4652
self.alerts_v42_supported = None
4753

4854
def reset_state(self) -> None:
49-
"""Reset in-memory caches for a new collection run."""
55+
"""Reset in-memory caches and counters for a new collection run."""
5056
self.events = {}
5157
self.audits = {}
5258
self.alerts = {}
5359
self.tasks = {}
60+
self.events_count = 0
61+
self.tasks_count = 0
62+
self.audits_count = 0
63+
self.alerts_count = 0
5464

5565
def _collect(
5666
self,
@@ -68,24 +78,18 @@ def _collect(
6878
now = get_current_datetime()
6979
start_time = (now - timedelta(seconds=self.check.sampling_interval)).isoformat().replace("+00:00", "Z")
7080

71-
self.check.log.debug(
72-
"[PC:%s:%s] Collecting %ss since: %s", self.check.pc_ip, self.check.pc_port, activity_kind, start_time
73-
)
81+
self.check.log.debug("[%s] Collecting %ss since: %s", self._pc_label, activity_kind, start_time)
7482

7583
items = list_fn(start_time)
7684
if not items:
77-
self.check.log.debug("[PC:%s:%s] No %ss found", self.check.pc_ip, self.check.pc_port, activity_kind)
85+
self.check.log.debug("[%s] No %ss found", self._pc_label, activity_kind)
7886
return 0
7987

80-
self.check.log.debug(
81-
"[PC:%s:%s] Fetched %d %ss from API", self.check.pc_ip, self.check.pc_port, len(items), activity_kind
82-
)
88+
self.check.log.debug("[%s] Fetched %d %ss from API", self._pc_label, len(items), activity_kind)
8389

8490
items = self._filter_after_time(items, last_time, time_field)
8591
if not items:
86-
self.check.log.debug(
87-
"[PC:%s:%s] No new %ss after filtering", self.check.pc_ip, self.check.pc_port, activity_kind
88-
)
92+
self.check.log.debug("[%s] No new %ss after filtering", self._pc_label, activity_kind)
8993
return 0
9094

9195
# Advance past all fetched items before applying resource filters
@@ -103,9 +107,8 @@ def _collect(
103107
cache[ext_id] = item
104108

105109
self.check.log.debug(
106-
"[PC:%s:%s] Processing %d %ss after filtering",
107-
self.check.pc_ip,
108-
self.check.pc_port,
110+
"[%s] Processing %d %ss after filtering",
111+
self._pc_label,
109112
len(items),
110113
activity_kind,
111114
)
@@ -117,9 +120,8 @@ def _collect(
117120
setattr(self, cache_key, most_recent_time_str)
118121
self.check.write_persistent_cache(cache_key, most_recent_time_str)
119122
self.check.log.debug(
120-
"[PC:%s:%s] Updated %s to: %s",
121-
self.check.pc_ip,
122-
self.check.pc_port,
123+
"[%s] Updated %s to: %s",
124+
self._pc_label,
123125
cache_key,
124126
most_recent_time_str,
125127
)
@@ -132,24 +134,22 @@ def _safe_collect(self, activity_kind: str, collect_fn: Callable[[], int]) -> in
132134
return collect_fn()
133135
except HTTPError as e:
134136
self.check.log.error(
135-
"[PC:%s:%s] Failed to collect %ss: HTTP %s",
136-
self.check.pc_ip,
137-
self.check.pc_port,
137+
"[%s] Failed to collect %ss: HTTP %s",
138+
self._pc_label,
138139
activity_kind,
139140
e.response.status_code if e.response else "error",
140141
)
141142
return 0
142143
except Exception:
143144
self.check.log.exception(
144-
"[PC:%s:%s] Unexpected error collecting %ss",
145-
self.check.pc_ip,
146-
self.check.pc_port,
145+
"[%s] Unexpected error collecting %ss",
146+
self._pc_label,
147147
activity_kind,
148148
)
149149
return 0
150150

151-
def collect_events(self) -> int:
152-
return self._safe_collect(
151+
def collect_events(self) -> None:
152+
self.events_count = self._safe_collect(
153153
"event",
154154
lambda: self._collect(
155155
activity_kind="event",
@@ -160,13 +160,13 @@ def collect_events(self) -> int:
160160
),
161161
)
162162

163-
def collect_tasks(self) -> int:
163+
def collect_tasks(self) -> None:
164164
def _filter_subtasks(tasks: list[dict]) -> list[dict]:
165165
if not self.check.collect_subtasks_enabled:
166166
return [t for t in tasks if not t.get("parentTask")]
167167
return tasks
168168

169-
return self._safe_collect(
169+
self.tasks_count = self._safe_collect(
170170
"task",
171171
lambda: self._collect(
172172
activity_kind="task",
@@ -178,8 +178,8 @@ def _filter_subtasks(tasks: list[dict]) -> list[dict]:
178178
),
179179
)
180180

181-
def collect_audits(self) -> int:
182-
return self._safe_collect(
181+
def collect_audits(self) -> None:
182+
self.audits_count = self._safe_collect(
183183
"audit",
184184
lambda: self._collect(
185185
activity_kind="audit",
@@ -190,8 +190,8 @@ def collect_audits(self) -> int:
190190
),
191191
)
192192

193-
def collect_alerts(self) -> int:
194-
return self._safe_collect(
193+
def collect_alerts(self) -> None:
194+
self.alerts_count = self._safe_collect(
195195
"alert",
196196
lambda: self._collect(
197197
activity_kind="alert",
@@ -218,30 +218,26 @@ def _list_alerts(self, start_time_str: str) -> list[dict]:
218218
}
219219

220220
if self.alerts_v42_supported is False:
221-
self.check.log.debug(
222-
"[PC:%s:%s] Using alerts API v4.0 (v4.2 not supported)", self.check.pc_ip, self.check.pc_port
223-
)
221+
self.check.log.debug("[%s] Using alerts API v4.0 (v4.2 not supported)", self._pc_label)
224222
del params["$filter"]
225223
return self.check._get_paginated_request_data("api/monitoring/v4.0/serviceability/alerts", params=params)
226224

227225
try:
228-
self.check.log.debug("[PC:%s:%s] Attempting to use alerts API v4.2", self.check.pc_ip, self.check.pc_port)
226+
self.check.log.debug("[%s] Attempting to use alerts API v4.2", self._pc_label)
229227
result = self.check._get_paginated_request_data("api/monitoring/v4.2/serviceability/alerts", params=params)
230228
if self.alerts_v42_supported is None:
231229
self.check.log.debug(
232-
"[PC:%s:%s] Alerts API v4.2 is supported, caching for future use",
233-
self.check.pc_ip,
234-
self.check.pc_port,
230+
"[%s] Alerts API v4.2 is supported, caching for future use",
231+
self._pc_label,
235232
)
236233
self.alerts_v42_supported = True
237234
self.check.write_persistent_cache("alerts_v42_supported", "True")
238235
return result
239236
except HTTPError as e:
240237
if e.response is not None and e.response.status_code == 404:
241238
self.check.log.debug(
242-
"[PC:%s:%s] Alerts API v4.2 not supported, falling back to v4.0 permanently",
243-
self.check.pc_ip,
244-
self.check.pc_port,
239+
"[%s] Alerts API v4.2 not supported, falling back to v4.0 permanently",
240+
self._pc_label,
245241
)
246242
self.alerts_v42_supported = False
247243
self.check.write_persistent_cache("alerts_v42_supported", "False")
@@ -261,28 +257,25 @@ def _get_alert(self, alert_ext_id: str) -> dict | None:
261257
endpoint = "api/monitoring/v4.2/serviceability/alerts"
262258

263259
self.check.log.debug(
264-
"[PC:%s:%s] Alert %s not in cache, fetching from API",
265-
self.check.pc_ip,
266-
self.check.pc_port,
260+
"[%s] Alert %s not in cache, fetching from API",
261+
self._pc_label,
267262
alert_ext_id,
268263
)
269264
try:
270265
alert = self.check._get_request_data(f"{endpoint}/{alert_ext_id}")
271266
if alert:
272267
self.alerts[alert_ext_id] = alert
273268
self.check.log.debug(
274-
"[PC:%s:%s] Fetched alert %s: %s",
275-
self.check.pc_ip,
276-
self.check.pc_port,
269+
"[%s] Fetched alert %s: %s",
270+
self._pc_label,
277271
alert_ext_id,
278272
alert.get("title", ""),
279273
)
280274
return alert
281275
except Exception as e:
282276
self.check.log.debug(
283-
"[PC:%s:%s] Failed to fetch alert %s: %s",
284-
self.check.pc_ip,
285-
self.check.pc_port,
277+
"[%s] Failed to fetch alert %s: %s",
278+
self._pc_label,
286279
alert_ext_id,
287280
e,
288281
)
@@ -341,9 +334,8 @@ def _process_audit(self, audit: dict) -> None:
341334

342335
# Log audit submission for duplicate debugging
343336
self.check.log.debug(
344-
"[PC:%s:%s]%s Submitting audit - ID: %s, CreationTime: %s",
345-
self.check.pc_ip,
346-
self.check.pc_port,
337+
"[%s]%s Submitting audit - ID: %s, CreationTime: %s",
338+
self._pc_label,
347339
cluster_label,
348340
audit_id,
349341
audit.get("creationTime", "unknown"),
@@ -547,9 +539,7 @@ def _parse_iso(self, timestamp_str: str) -> datetime | None:
547539
try:
548540
return datetime.fromisoformat(timestamp_str.replace("Z", "+00:00"))
549541
except (ValueError, AttributeError):
550-
self.check.log.warning(
551-
"[PC:%s:%s] Failed to parse timestamp: %s", self.check.pc_ip, self.check.pc_port, timestamp_str
552-
)
542+
self.check.log.warning("[%s] Failed to parse timestamp: %s", self._pc_label, timestamp_str)
553543
return None
554544

555545
def _parse_timestamp(self, timestamp_str: str) -> int | None:

nutanix/datadog_checks/nutanix/check.py

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,10 @@ def _parse_config(self):
6060

6161
self.batch_vm_collection = is_affirmative(self.instance.get("batch_vm_collection", True))
6262

63+
self.exclude_filtered_resources_from_cluster_capacity = is_affirmative(
64+
self.instance.get("exclude_filtered_resources_from_cluster_capacity", False)
65+
)
66+
6367
self.prefix_category_tags = is_affirmative(self.instance.get("prefix_category_tags", False))
6468

6569
self.resource_filters = parse_resource_filters(self.instance.get("resource_filters") or [], self.log)
@@ -144,9 +148,8 @@ def check(self, _):
144148
self.activity_monitor.reset_state()
145149

146150
if not self._check_health():
147-
self.log.warning("[PC:%s:%s] Health check failed, aborting", self.pc_ip, self.pc_port)
151+
self.log.error("[PC:%s:%s] Health check failed, aborting", self.pc_ip, self.pc_port)
148152
return
149-
150153
self.infrastructure_monitor.init_collection_time_window()
151154
start_time, end_time = self.infrastructure_monitor.collection_time_window
152155
window_seconds = (datetime.fromisoformat(end_time) - datetime.fromisoformat(start_time)).total_seconds()
@@ -162,7 +165,7 @@ def check(self, _):
162165

163166
self.infrastructure_monitor.collect_cluster_metrics()
164167

165-
events_count, tasks_count, audits_count, alerts_count = self._collect_activity()
168+
self._collect_activity()
166169

167170
if self.infrastructure_monitor.external_tags:
168171
self.set_external_tags(self.infrastructure_monitor.external_tags)
@@ -173,19 +176,22 @@ def check(self, _):
173176
self.infrastructure_monitor.cluster_count,
174177
self.infrastructure_monitor.host_count,
175178
self.infrastructure_monitor.vm_count,
176-
events_count,
177-
tasks_count,
178-
audits_count,
179-
alerts_count,
179+
self.activity_monitor.events_count,
180+
self.activity_monitor.tasks_count,
181+
self.activity_monitor.audits_count,
182+
self.activity_monitor.alerts_count,
180183
)
181184

182-
def _collect_activity(self) -> tuple[int, int, int, int]:
185+
def _collect_activity(self) -> None:
183186
"""Collect events, tasks, audits, and alerts if enabled."""
184-
events_count = self.activity_monitor.collect_events() if self.collect_events_enabled else 0
185-
alerts_count = self.activity_monitor.collect_alerts() if self.collect_alerts_enabled else 0
186-
tasks_count = self.activity_monitor.collect_tasks() if self.collect_tasks_enabled else 0
187-
audits_count = self.activity_monitor.collect_audits() if self.collect_audits_enabled else 0
188-
return events_count, tasks_count, audits_count, alerts_count
187+
if self.collect_events_enabled:
188+
self.activity_monitor.collect_events()
189+
if self.collect_alerts_enabled:
190+
self.activity_monitor.collect_alerts()
191+
if self.collect_tasks_enabled:
192+
self.activity_monitor.collect_tasks()
193+
if self.collect_audits_enabled:
194+
self.activity_monitor.collect_audits()
189195

190196
def _check_health(self):
191197
try:

nutanix/datadog_checks/nutanix/config_models/defaults.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,10 @@ def instance_enable_legacy_tags_normalization():
5252
return True
5353

5454

55+
def instance_exclude_filtered_resources_from_cluster_capacity():
56+
return False
57+
58+
5559
def instance_kerberos_auth():
5660
return 'disabled'
5761

nutanix/datadog_checks/nutanix/config_models/instance.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ class InstanceConfig(BaseModel):
8787
disable_generic_tags: Optional[bool] = None
8888
empty_default_hostname: Optional[bool] = None
8989
enable_legacy_tags_normalization: Optional[bool] = None
90+
exclude_filtered_resources_from_cluster_capacity: Optional[bool] = None
9091
extra_headers: Optional[MappingProxyType[str, Any]] = None
9192
headers: Optional[MappingProxyType[str, Any]] = None
9293
kerberos_auth: Optional[Literal['required', 'optional', 'disabled']] = None

nutanix/datadog_checks/nutanix/data/conf.yaml.example

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,16 @@ instances:
167167
#
168168
# collect_subtasks: false
169169

170+
## @param exclude_filtered_resources_from_cluster_capacity - boolean - optional - default: false
171+
## Whether to exclude filtered resources (hosts and VMs excluded by power state or
172+
## resource_filters) from cluster-level capacity metrics.
173+
## When false (default), cluster capacity reflects total provisioned resources across
174+
## all hosts and VMs, regardless of whether individual metrics are reported.
175+
## When true, only resources that pass filter checks contribute to cluster capacity
176+
## metrics (total_cores, total_threads, total_bytes, vcpus_allocated, memory_allocated_bytes).
177+
#
178+
# exclude_filtered_resources_from_cluster_capacity: false
179+
170180
## @param batch_vm_collection - boolean - optional - default: true
171181
## Whether to fetch all VMs in a single paginated API call instead of per-host.
172182
## When true, VMs are fetched once and grouped by host in-memory, significantly

0 commit comments

Comments
 (0)