Commit 3bce4ec
authored
Fix silent drop of ztunnel counter metrics in Istio ambient mode (DataDog#23707)
* Fix silent drop of ztunnel counter metrics in Istio ambient mode
Ztunnel emits the modern OpenMetrics counter convention (TYPE declared with
the base name, samples emitted with the _total suffix). The integration's
ztunnel sub-scraper was routed through the legacy prometheus_client parser,
which does not add _total to a counter's allowed sample names and yields
each sample as an untyped singleton. Combined with construct_metrics_config
stripping _total from registration keys, every ztunnel counter was silently
dropped (failure visible only at DEBUG level).
Force use_latest_spec=True on the ztunnel sub-scraper so the OpenMetrics
parser is used. Scoped to ztunnel only; waypoint (Envoy, legacy convention)
and istiod (Go client_golang, legacy convention) are unaffected. The user
can still opt out by setting use_latest_spec: false on the instance.
Adds an ambient e2e environment (py3.13-1.24-ambient) that installs
Istio 1.24.3 with the ambient profile, deploys bookinfo, applies a
waypoint, and runs a traffic generator so ztunnel counters are non-zero.
The unit test fixture is now the real output captured from ztunnel
running in that environment; the prior hand-written 1.5/ztunnel.txt
fixture used a non-realistic TYPE convention and is removed.
Adds test_ambient_ztunnel_legacy_parser_drops_counters that pins the
broken behavior under use_latest_spec=false so a future regression that
removes the default cannot reintroduce the silent drop.
* Add changelog entry
* Make ambient e2e teardown robust to missing kubectl.pid
In CI, ddev env test runs the dd_environment fixture's setup and
teardown in different pytest invocations. The DDEV_E2E_ENV_* env vars
that map per-port-forward TempDir names to actual paths do not survive
that transition, so KillProcess opens a freshly mkdtemp'd path and
raises FileNotFoundError on kubectl.pid lookup. The kubectl
port-forward processes are already dying on their own because kind
delete cluster has removed the API server they were forwarding to, so
swallowing the missing-pid-file error is safe. Wrap port_forward in a
safe_port_forward context manager that suppresses FileNotFoundError on
exit. Sidecar (single port-forward) is not affected; ambient (three
port-forwards) hits this every run.
* Use a Service for ztunnel port-forward instead of a dynamic pod
The previous attempt port-forwarded pod/ztunnel-<random-suffix>. The
dd_environment fixture runs in two pytest invocations under ddev env
test (setup and teardown), and on the teardown invocation
_get_first_ztunnel_pod runs against a kind cluster that has already
been deleted, returning an empty string. port_forward then builds a
different TempDir key than setup, the env-var bridge misses, and
KillProcess opens an empty mkdtemp'd dir, raising FileNotFoundError on
kubectl.pid. Other integrations like kuma avoid this by port-forwarding
a Service (stable name).
Create a Service in setup_istio_ambient via `kubectl expose daemonset
ztunnel`, port-forward service/ztunnel-metrics. Drop the _get_first_-
ztunnel_pod helper and the safe_port_forward workaround.
* Create ztunnel Service via manifest, not kubectl expose
kubectl expose does not support DaemonSets (error: cannot expose a
DaemonSet.apps). The previous attempt to use kubectl expose silently
failed and the port-forward of service/ztunnel-metrics then connected
to nothing, surfacing as 'Connection refused' on the agent side. Apply
a tracked Service manifest at tests/kind/ztunnel_service.yaml instead.
* Restructure matrix orthogonally: version x mode
The matrix previously folded mode into a single version-like dimension
(version = ["1.13", "1.24-ambient"]), which scales poorly: adding
ambient support for a new Istio version meant introducing a new
"1.xx-ambient" pseudo-version alongside the plain "1.xx" entry. Split
into orthogonal dimensions:
version = ["1.13"] mode = ["sidecar"] # block 1
version = ["1.24"] mode = ["ambient"] # block 2
Two matrix blocks express the constraint that 1.13 does not support
ambient (it was GA in 1.24). Adding a future version is a one-line
change to the relevant block; adding a new mode on an existing version
is a one-line change too once the matching setup function is in place.
Env names become py3.13-1.13-sidecar and py3.13-1.24-ambient, with
ISTIO_VERSION and ISTIO_MODE env vars driven independently from the
two matrix axes. conftest's fallthrough switches to `else` (i.e. the
sidecar branch) to match the new matrix semantics rather than gating
on a hard-coded VERSION string.
* Use matrix-axis shorthand for ISTIO_MODE env var
* Add Istio 1.24 to the sidecar e2e matrix
Migrate setup_istio off the baked IstioOperator manifest and onto
`istioctl install --set profile=demo`, which is version-agnostic. Drop
the now-unused 1.13-only manifest and extend the sidecar matrix to also
cover 1.24, so the integration is exercised end-to-end against a
modern Istio release in sidecar mode.
* Mark Istio galley validation pass/fail metrics intermittent
A fresh Istio 1.24 install on bookinfo emits istio.galley.validation.passed
but not the .failed variant (no validation errors occur). The .passed.count
v2 sibling was also missing from the list. Move both pairs into the
intermittent list so the e2e tests pass on a clean cluster.
* Mark gc_cpu_fraction intermittent for Istio 1.24
* Mark memstats.lookups intermittent for Istio 1.24
* Mark all pilot.conflict variants intermittent
* Split legacy Go metrics from intermittent in Istio e2e
The Go runtime metrics istio.go.memstats.gc_cpu_fraction and lookups_total
are no longer emitted by Istio 1.24's binary (deprecated in Go 1.20 and
later removed from client_golang's default exposition), but they are still
emitted by Istio 1.13. Calling them "intermittent" on 1.13 silently hid any
collection bug there.
Pull them out into a LEGACY_GO_METRICS set that is strictly asserted on
1.13 envs and skipped on 1.24+. Genuinely-conditional metrics (validation
failures, listener conflicts, sidecar injection events, etc.) remain in
INTERMITTENT_METRICS and are asserted at_least=0 on every version.
* Parse Istio version numerically for legacy check
* Use packaging.version for Istio version comparison
* Address PR review feedback
- test_e2e_ambient: replace the dead at_least=0 conditional on
ISTIOD_V2_METRICS with the existing _assert_istiod_metric helper so
ambient mode also strictly asserts non-intermittent istiod metrics.
- check.py: log a warning when the user opts out of the OpenMetrics
parser (use_latest_spec: false) while ztunnel is being scraped.
Comment the asymmetry between namespace (always restored) and
use_latest_spec (user-overridable default) inside _generate_config.
- conftest.py: replace the hard-coded 15-second sleep after the traffic
generator pod is created with a 300-second polling loop that waits
for ztunnel to report a non-zero TCP connection counter.
- conftest.py: return osx-amd64 (not osx) for intel macOS in the Istio
release suffix helper so local-dev downloads do not 404.
- changelog: rephrase to name the affected version (9.4.0) so customers
searching the changelog for the regression window can find this fix.
* Fix ztunnel poll target and add Istio 1.29 to the matrix
- _wait_for_ztunnel_traffic was exec'ing into the ztunnel pod, which
runs a minimal Rust binary without curl. Move the kubectl exec into
the traffic-gen pod (curlimages/curl) and target the ztunnel-metrics
Service applied earlier in setup. Capture stderr instead of leaking
it through CI logs.
- Extend the matrix with Istio 1.29 (the current supported release as
of 2026-05-18) in both sidecar and ambient mode. 1.24 stays since
that is where ambient GA'd; 1.13 stays for the legacy Go-runtime
safety net.
* Address round-2 review feedback
- check.py: use is_affirmative for the use_latest_spec opt-out guard
so YAML string booleans ("false") also trigger the warning.
- check.py: replace use_latest_spec_default kwarg with a keyword-only
scraper_defaults mapping; precedence (defaults -> instance -> per-
scraper restore) is now visible at the signature without an inline
comment.
- test_unit_istio_v2: assert directly that the ztunnel scraper config
carries use_latest_spec=True, so a refactor that drops the default
fails on shape rather than on fixture parsing.
- test_e2e: treat unparseable ISTIO_VERSION as legacy so misconfigured
local runs fail loudly instead of silently skipping LEGACY_GO_METRICS.
- conftest: justify the per-line ValueError skip in _ztunnel_has_traffic
with an inline comment.
* Address round-3 review feedback
- check.py: restore the invariant comment next to `config.update`
so the namespace-must-follow-update rule is visible in code shape.
- metadata.csv: rename the four ztunnel TCP counters from .total to
.count to match the names the OpenMetrics V2 transformer actually
submits (construct_metrics_config strips the _total suffix). This
closes the metadata loop on the same regression class the PR fixes.
- test_unit_istio_v2: add assert_metrics_using_metadata to the ztunnel
metrics test so future mismatches between metrics.py, the transformer,
and metadata.csv fail fast.
- conftest: kubectl wait for traffic-gen Ready before polling so the
300s budget is spent on the actual wait-for-traffic signal instead
of on "container not found" exec failures while the image pulls.
- conftest: capture the last non-zero exec stderr and surface it in
the RuntimeError so CI failures point at the real cause instead of
a generic timeout message.
* Address round-4 review feedback
- Raise on the kubectl wait for traffic-gen so a timeout / pod-schedule
failure surfaces immediately instead of being swallowed and then
misattributed by the 300s polling loop downstream.
- Build the ztunnel /stats/prometheus URL from the existing module
constants so a future port or service rename touches one place.
- Capture the last stdout excerpt on iterations where curl exits 0 but
ztunnel reports no traffic (-sm 5 does not pass -f, so HTTP 4xx/5xx
bodies land in stdout); include it in the final RuntimeError so the
failure points at the real cause rather than reading "<none>".
* Rename mechanical ztunnel .total counters to .count
The OpenMetrics V2 transformer strips _total from the registered name
and appends .count for counter-type metrics, so eight more ztunnel
counter rows in metadata.csv (DNS, on_demand_dns, xds.connection_
terminations, connection.{opens,closes,termination}) were declared
under the wrong suffix and went orphan once the V2 parser engaged.
Basenames already match what metrics.py registers and what ztunnel
1.24 emits, so the rename is mechanical.
The four remaining .total rows (proxies_{started,stopped}, active/
pending_proxy_count) have wrong basenames at the source — ztunnel
1.24 emits these under workload_manager_* — and need their
registration corrected in a follow-up, not a suffix-only rewrite.
* Realign ztunnel metric registration with real exposition
The four in-pod proxy management metrics were registered under
istio_proxies_started_total, istio_active_proxy_count_total etc.,
but ztunnel 1.24 emits them under the workload_manager_* family.
Two more counters ztunnel emits (istio_xds_message_total and
istio_xds_message_bytes_total) were not registered at all.
Realign the registrations against the captured ztunnel exposition,
update the four corresponding metadata.csv rows (correct suffix per
gauge/counter type), add metadata entries for the two new xds
counters, and expand V2_ZTUNNEL_METRICS so the unit and e2e tests
assert all nine ztunnel metrics that the integration is now
collecting. Split V2_ZTUNNEL_METRICS into counter / gauge sub-lists
so the legacy-parser regression test pins the broken behavior on
the counter set only (gauges are not affected by the parser bug).
* Tighten prose hygiene and warning-silence regression
- conftest: drop the seven-line docstring on the private
_wait_for_ztunnel_traffic helper for a one-liner per AGENTS.md,
with a short inline comment at the call site explaining the
curl-from-traffic-gen indirection.
- test_unit_istio_v2: assert in the happy-path ztunnel test that
the opt-out warning stays silent, so a future change that emits
it unconditionally fails the test instead of shipping noise.
- common.py: collapse the multi-line block introducing the
V2_ZTUNNEL_COUNTER_METRICS / V2_ZTUNNEL_GAUGE_METRICS split into
two one-line comments.
* Collapse prose comments per AGENTS.md
- check.py: replace the four-line block above the ztunnel
if-block with a single line. The scraper_defaults kwarg and
the opt-out warning already explain the why.
- common.py: replace the wrapped comment above
V2_ZTUNNEL_COUNTER_METRICS with a single line that names the
format-level reason (TYPE base name + _total samples) instead
of the PR-narrative "before this PR's fix" framing.
* Expand changelog to cover full ambient mode fix
* Consolidate ambient changelog into a single fix entry
* Tighten changelog wording
* Drop dead istio_connection_* registrations from ZTUNNEL_METRICS
These three counter registrations point at metric names ztunnel never
emits. A search across the ztunnel source confirms only the
tcp_connections_* and xds_connection_terminations families exist; the
istio_connection_opens_total / closes_total / termination_total entries
were dead from the start and carried matching orphan rows in
metadata.csv. Removing both.
* Document ambient mode and mark its settings fleet-configurable
PR DataDog#22581 introduced ambient mode but never updated the docs or spec
metadata. The three instance options it added (istio_mode,
ztunnel_endpoint, waypoint_endpoint) were the only instance-level
settings in the spec without fleet_configurable: true, so customers
managing Istio from Datadog Fleet Automation could remote-configure
every Istio setting except those three. The README similarly had no
mention of ambient mode at all.
- Add fleet_configurable: true to the three ambient settings in
assets/configuration/spec.yaml; regenerate config_models.
- Add an "Ambient mode configuration" subsection to the README under
Metric collection, showing the ztunnel annotation pattern and
pointing at waypoint as the L7 option.
* Correct README: one ambient instance scrapes all three endpoints
The previous wording suggested a second instance for waypoint metrics,
but _parse_ambient_config reads ztunnel_endpoint, waypoint_endpoint,
and istiod_endpoint from the same instance and spawns sub-scrapers
for each. Replace the per-pod annotation example with a static
conf.yaml covering all three components in one instance.
* Drop hard-coded ports from ambient README prose
Port 15020 is the Istio default but configurable; mentioning specific
numbers in the prose can mislead a reader running a non-default
exposition. The example URLs still show the conventional defaults,
which is the right place for that information since users edit the
URLs directly when their setup differs.
* Address documentation review feedback on ambient README section
- Spell out GA as 'generally available' on first mention.
- Split the sentence joined by an em dash into two complete sentences.
- Capitalize Autodiscovery (Datadog feature name).
* Restore license headers on regenerated config_models
Local ddev's model regeneration produced these files without the
Datadog license header that the master branch tracks. CI's validate
models check caught the drift.1 parent 87825d7 commit 3bce4ec
15 files changed
Lines changed: 486 additions & 11456 deletions
File tree
- istio
- assets/configuration
- changelog.d
- datadog_checks/istio
- tests
- fixtures
- 1.24
- 1.5
- kind
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
93 | 112 | | |
94 | 113 | | |
95 | 114 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| 39 | + | |
38 | 40 | | |
39 | 41 | | |
40 | 42 | | |
| |||
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
| 50 | + | |
48 | 51 | | |
49 | 52 | | |
50 | 53 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
75 | | - | |
| 75 | + | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
79 | 93 | | |
80 | 94 | | |
81 | 95 | | |
| |||
86 | 100 | | |
87 | 101 | | |
88 | 102 | | |
89 | | - | |
| 103 | + | |
90 | 104 | | |
91 | 105 | | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
| 106 | + | |
| 107 | + | |
97 | 108 | | |
98 | | - | |
99 | 109 | | |
100 | 110 | | |
101 | 111 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
276 | 276 | | |
277 | 277 | | |
278 | 278 | | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
284 | 285 | | |
285 | 286 | | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
| 287 | + | |
| 288 | + | |
290 | 289 | | |
291 | 290 | | |
292 | 291 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
8 | 24 | | |
9 | 25 | | |
10 | | - | |
| 26 | + | |
| 27 | + | |
11 | 28 | | |
12 | 29 | | |
13 | 30 | | |
14 | 31 | | |
| 32 | + | |
| 33 | + | |
15 | 34 | | |
| 35 | + | |
16 | 36 | | |
17 | 37 | | |
18 | 38 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
447 | 447 | | |
448 | 448 | | |
449 | 449 | | |
450 | | - | |
451 | | - | |
452 | | - | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
457 | 457 | | |
458 | 458 | | |
459 | | - | |
460 | | - | |
461 | | - | |
462 | | - | |
463 | | - | |
464 | | - | |
465 | | - | |
466 | | - | |
467 | | - | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
468 | 467 | | |
469 | 468 | | |
470 | 469 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
409 | 409 | | |
410 | 410 | | |
411 | 411 | | |
412 | | - | |
413 | | - | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
414 | 415 | | |
415 | 416 | | |
416 | 417 | | |
417 | 418 | | |
418 | | - | |
419 | | - | |
420 | | - | |
421 | | - | |
422 | | - | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
423 | 422 | | |
424 | 423 | | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
425 | 432 | | |
426 | 433 | | |
427 | 434 | | |
| |||
0 commit comments