Skip to content

[pull] master from DataDog:master#549

Merged
pull[bot] merged 1 commit into
ConnectionMaster:masterfrom
DataDog:master
May 20, 2026
Merged

[pull] master from DataDog:master#549
pull[bot] merged 1 commit into
ConnectionMaster:masterfrom
DataDog:master

Conversation

@pull

@pull pull Bot commented May 20, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

)

* Fix silent drop of ztunnel counter metrics in Istio ambient mode

Ztunnel emits the modern OpenMetrics counter convention (TYPE declared with
the base name, samples emitted with the _total suffix). The integration's
ztunnel sub-scraper was routed through the legacy prometheus_client parser,
which does not add _total to a counter's allowed sample names and yields
each sample as an untyped singleton. Combined with construct_metrics_config
stripping _total from registration keys, every ztunnel counter was silently
dropped (failure visible only at DEBUG level).

Force use_latest_spec=True on the ztunnel sub-scraper so the OpenMetrics
parser is used. Scoped to ztunnel only; waypoint (Envoy, legacy convention)
and istiod (Go client_golang, legacy convention) are unaffected. The user
can still opt out by setting use_latest_spec: false on the instance.

Adds an ambient e2e environment (py3.13-1.24-ambient) that installs
Istio 1.24.3 with the ambient profile, deploys bookinfo, applies a
waypoint, and runs a traffic generator so ztunnel counters are non-zero.
The unit test fixture is now the real output captured from ztunnel
running in that environment; the prior hand-written 1.5/ztunnel.txt
fixture used a non-realistic TYPE convention and is removed.

Adds test_ambient_ztunnel_legacy_parser_drops_counters that pins the
broken behavior under use_latest_spec=false so a future regression that
removes the default cannot reintroduce the silent drop.

* Add changelog entry

* Make ambient e2e teardown robust to missing kubectl.pid

In CI, ddev env test runs the dd_environment fixture's setup and
teardown in different pytest invocations. The DDEV_E2E_ENV_* env vars
that map per-port-forward TempDir names to actual paths do not survive
that transition, so KillProcess opens a freshly mkdtemp'd path and
raises FileNotFoundError on kubectl.pid lookup. The kubectl
port-forward processes are already dying on their own because kind
delete cluster has removed the API server they were forwarding to, so
swallowing the missing-pid-file error is safe. Wrap port_forward in a
safe_port_forward context manager that suppresses FileNotFoundError on
exit. Sidecar (single port-forward) is not affected; ambient (three
port-forwards) hits this every run.

* Use a Service for ztunnel port-forward instead of a dynamic pod

The previous attempt port-forwarded pod/ztunnel-<random-suffix>. The
dd_environment fixture runs in two pytest invocations under ddev env
test (setup and teardown), and on the teardown invocation
_get_first_ztunnel_pod runs against a kind cluster that has already
been deleted, returning an empty string. port_forward then builds a
different TempDir key than setup, the env-var bridge misses, and
KillProcess opens an empty mkdtemp'd dir, raising FileNotFoundError on
kubectl.pid. Other integrations like kuma avoid this by port-forwarding
a Service (stable name).

Create a Service in setup_istio_ambient via `kubectl expose daemonset
ztunnel`, port-forward service/ztunnel-metrics. Drop the _get_first_-
ztunnel_pod helper and the safe_port_forward workaround.

* Create ztunnel Service via manifest, not kubectl expose

kubectl expose does not support DaemonSets (error: cannot expose a
DaemonSet.apps). The previous attempt to use kubectl expose silently
failed and the port-forward of service/ztunnel-metrics then connected
to nothing, surfacing as 'Connection refused' on the agent side. Apply
a tracked Service manifest at tests/kind/ztunnel_service.yaml instead.

* Restructure matrix orthogonally: version x mode

The matrix previously folded mode into a single version-like dimension
(version = ["1.13", "1.24-ambient"]), which scales poorly: adding
ambient support for a new Istio version meant introducing a new
"1.xx-ambient" pseudo-version alongside the plain "1.xx" entry. Split
into orthogonal dimensions:

  version = ["1.13"] mode = ["sidecar"]   # block 1
  version = ["1.24"] mode = ["ambient"]   # block 2

Two matrix blocks express the constraint that 1.13 does not support
ambient (it was GA in 1.24). Adding a future version is a one-line
change to the relevant block; adding a new mode on an existing version
is a one-line change too once the matching setup function is in place.
Env names become py3.13-1.13-sidecar and py3.13-1.24-ambient, with
ISTIO_VERSION and ISTIO_MODE env vars driven independently from the
two matrix axes. conftest's fallthrough switches to `else` (i.e. the
sidecar branch) to match the new matrix semantics rather than gating
on a hard-coded VERSION string.

* Use matrix-axis shorthand for ISTIO_MODE env var

* Add Istio 1.24 to the sidecar e2e matrix

Migrate setup_istio off the baked IstioOperator manifest and onto
`istioctl install --set profile=demo`, which is version-agnostic. Drop
the now-unused 1.13-only manifest and extend the sidecar matrix to also
cover 1.24, so the integration is exercised end-to-end against a
modern Istio release in sidecar mode.

* Mark Istio galley validation pass/fail metrics intermittent

A fresh Istio 1.24 install on bookinfo emits istio.galley.validation.passed
but not the .failed variant (no validation errors occur). The .passed.count
v2 sibling was also missing from the list. Move both pairs into the
intermittent list so the e2e tests pass on a clean cluster.

* Mark gc_cpu_fraction intermittent for Istio 1.24

* Mark memstats.lookups intermittent for Istio 1.24

* Mark all pilot.conflict variants intermittent

* Split legacy Go metrics from intermittent in Istio e2e

The Go runtime metrics istio.go.memstats.gc_cpu_fraction and lookups_total
are no longer emitted by Istio 1.24's binary (deprecated in Go 1.20 and
later removed from client_golang's default exposition), but they are still
emitted by Istio 1.13. Calling them "intermittent" on 1.13 silently hid any
collection bug there.

Pull them out into a LEGACY_GO_METRICS set that is strictly asserted on
1.13 envs and skipped on 1.24+. Genuinely-conditional metrics (validation
failures, listener conflicts, sidecar injection events, etc.) remain in
INTERMITTENT_METRICS and are asserted at_least=0 on every version.

* Parse Istio version numerically for legacy check

* Use packaging.version for Istio version comparison

* Address PR review feedback

- test_e2e_ambient: replace the dead at_least=0 conditional on
  ISTIOD_V2_METRICS with the existing _assert_istiod_metric helper so
  ambient mode also strictly asserts non-intermittent istiod metrics.
- check.py: log a warning when the user opts out of the OpenMetrics
  parser (use_latest_spec: false) while ztunnel is being scraped.
  Comment the asymmetry between namespace (always restored) and
  use_latest_spec (user-overridable default) inside _generate_config.
- conftest.py: replace the hard-coded 15-second sleep after the traffic
  generator pod is created with a 300-second polling loop that waits
  for ztunnel to report a non-zero TCP connection counter.
- conftest.py: return osx-amd64 (not osx) for intel macOS in the Istio
  release suffix helper so local-dev downloads do not 404.
- changelog: rephrase to name the affected version (9.4.0) so customers
  searching the changelog for the regression window can find this fix.

* Fix ztunnel poll target and add Istio 1.29 to the matrix

- _wait_for_ztunnel_traffic was exec'ing into the ztunnel pod, which
  runs a minimal Rust binary without curl. Move the kubectl exec into
  the traffic-gen pod (curlimages/curl) and target the ztunnel-metrics
  Service applied earlier in setup. Capture stderr instead of leaking
  it through CI logs.
- Extend the matrix with Istio 1.29 (the current supported release as
  of 2026-05-18) in both sidecar and ambient mode. 1.24 stays since
  that is where ambient GA'd; 1.13 stays for the legacy Go-runtime
  safety net.

* Address round-2 review feedback

- check.py: use is_affirmative for the use_latest_spec opt-out guard
  so YAML string booleans ("false") also trigger the warning.
- check.py: replace use_latest_spec_default kwarg with a keyword-only
  scraper_defaults mapping; precedence (defaults -> instance -> per-
  scraper restore) is now visible at the signature without an inline
  comment.
- test_unit_istio_v2: assert directly that the ztunnel scraper config
  carries use_latest_spec=True, so a refactor that drops the default
  fails on shape rather than on fixture parsing.
- test_e2e: treat unparseable ISTIO_VERSION as legacy so misconfigured
  local runs fail loudly instead of silently skipping LEGACY_GO_METRICS.
- conftest: justify the per-line ValueError skip in _ztunnel_has_traffic
  with an inline comment.

* Address round-3 review feedback

- check.py: restore the invariant comment next to `config.update`
  so the namespace-must-follow-update rule is visible in code shape.
- metadata.csv: rename the four ztunnel TCP counters from .total to
  .count to match the names the OpenMetrics V2 transformer actually
  submits (construct_metrics_config strips the _total suffix). This
  closes the metadata loop on the same regression class the PR fixes.
- test_unit_istio_v2: add assert_metrics_using_metadata to the ztunnel
  metrics test so future mismatches between metrics.py, the transformer,
  and metadata.csv fail fast.
- conftest: kubectl wait for traffic-gen Ready before polling so the
  300s budget is spent on the actual wait-for-traffic signal instead
  of on "container not found" exec failures while the image pulls.
- conftest: capture the last non-zero exec stderr and surface it in
  the RuntimeError so CI failures point at the real cause instead of
  a generic timeout message.

* Address round-4 review feedback

- Raise on the kubectl wait for traffic-gen so a timeout / pod-schedule
  failure surfaces immediately instead of being swallowed and then
  misattributed by the 300s polling loop downstream.
- Build the ztunnel /stats/prometheus URL from the existing module
  constants so a future port or service rename touches one place.
- Capture the last stdout excerpt on iterations where curl exits 0 but
  ztunnel reports no traffic (-sm 5 does not pass -f, so HTTP 4xx/5xx
  bodies land in stdout); include it in the final RuntimeError so the
  failure points at the real cause rather than reading "<none>".

* Rename mechanical ztunnel .total counters to .count

The OpenMetrics V2 transformer strips _total from the registered name
and appends .count for counter-type metrics, so eight more ztunnel
counter rows in metadata.csv (DNS, on_demand_dns, xds.connection_
terminations, connection.{opens,closes,termination}) were declared
under the wrong suffix and went orphan once the V2 parser engaged.
Basenames already match what metrics.py registers and what ztunnel
1.24 emits, so the rename is mechanical.

The four remaining .total rows (proxies_{started,stopped}, active/
pending_proxy_count) have wrong basenames at the source — ztunnel
1.24 emits these under workload_manager_* — and need their
registration corrected in a follow-up, not a suffix-only rewrite.

* Realign ztunnel metric registration with real exposition

The four in-pod proxy management metrics were registered under
istio_proxies_started_total, istio_active_proxy_count_total etc.,
but ztunnel 1.24 emits them under the workload_manager_* family.
Two more counters ztunnel emits (istio_xds_message_total and
istio_xds_message_bytes_total) were not registered at all.

Realign the registrations against the captured ztunnel exposition,
update the four corresponding metadata.csv rows (correct suffix per
gauge/counter type), add metadata entries for the two new xds
counters, and expand V2_ZTUNNEL_METRICS so the unit and e2e tests
assert all nine ztunnel metrics that the integration is now
collecting. Split V2_ZTUNNEL_METRICS into counter / gauge sub-lists
so the legacy-parser regression test pins the broken behavior on
the counter set only (gauges are not affected by the parser bug).

* Tighten prose hygiene and warning-silence regression

- conftest: drop the seven-line docstring on the private
  _wait_for_ztunnel_traffic helper for a one-liner per AGENTS.md,
  with a short inline comment at the call site explaining the
  curl-from-traffic-gen indirection.
- test_unit_istio_v2: assert in the happy-path ztunnel test that
  the opt-out warning stays silent, so a future change that emits
  it unconditionally fails the test instead of shipping noise.
- common.py: collapse the multi-line block introducing the
  V2_ZTUNNEL_COUNTER_METRICS / V2_ZTUNNEL_GAUGE_METRICS split into
  two one-line comments.

* Collapse prose comments per AGENTS.md

- check.py: replace the four-line block above the ztunnel
  if-block with a single line. The scraper_defaults kwarg and
  the opt-out warning already explain the why.
- common.py: replace the wrapped comment above
  V2_ZTUNNEL_COUNTER_METRICS with a single line that names the
  format-level reason (TYPE base name + _total samples) instead
  of the PR-narrative "before this PR's fix" framing.

* Expand changelog to cover full ambient mode fix

* Consolidate ambient changelog into a single fix entry

* Tighten changelog wording

* Drop dead istio_connection_* registrations from ZTUNNEL_METRICS

These three counter registrations point at metric names ztunnel never
emits. A search across the ztunnel source confirms only the
tcp_connections_* and xds_connection_terminations families exist; the
istio_connection_opens_total / closes_total / termination_total entries
were dead from the start and carried matching orphan rows in
metadata.csv. Removing both.

* Document ambient mode and mark its settings fleet-configurable

PR #22581 introduced ambient mode but never updated the docs or spec
metadata. The three instance options it added (istio_mode,
ztunnel_endpoint, waypoint_endpoint) were the only instance-level
settings in the spec without fleet_configurable: true, so customers
managing Istio from Datadog Fleet Automation could remote-configure
every Istio setting except those three. The README similarly had no
mention of ambient mode at all.

- Add fleet_configurable: true to the three ambient settings in
  assets/configuration/spec.yaml; regenerate config_models.
- Add an "Ambient mode configuration" subsection to the README under
  Metric collection, showing the ztunnel annotation pattern and
  pointing at waypoint as the L7 option.

* Correct README: one ambient instance scrapes all three endpoints

The previous wording suggested a second instance for waypoint metrics,
but _parse_ambient_config reads ztunnel_endpoint, waypoint_endpoint,
and istiod_endpoint from the same instance and spawns sub-scrapers
for each. Replace the per-pod annotation example with a static
conf.yaml covering all three components in one instance.

* Drop hard-coded ports from ambient README prose

Port 15020 is the Istio default but configurable; mentioning specific
numbers in the prose can mislead a reader running a non-default
exposition. The example URLs still show the conventional defaults,
which is the right place for that information since users edit the
URLs directly when their setup differs.

* Address documentation review feedback on ambient README section

- Spell out GA as 'generally available' on first mention.
- Split the sentence joined by an em dash into two complete sentences.
- Capitalize Autodiscovery (Datadog feature name).

* Restore license headers on regenerated config_models

Local ddev's model regeneration produced these files without the
Datadog license header that the master branch tracks. CI's validate
models check caught the drift.
@pull pull Bot locked and limited conversation to collaborators May 20, 2026
@pull pull Bot added the ⤵️ pull label May 20, 2026
@pull pull Bot merged commit 3bce4ec into ConnectionMaster:master May 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant