Skip to content

Commit 656fac7

Browse files
authored
Add missing envoy.vhost.vcluster.upstream_rq_time.99_5percentile to metadata.csv (DataDog#23770)
* Add missing envoy.vhost.vcluster.upstream_rq_time.99_5percentile to metadata.csv * Add changelog entry for DataDog#23770 * Remove changelog entry (metadata-only change) * Exercise Envoy listener immediately before E2E check scrape Add a function-scoped exercise_envoy fixture that issues HTTP requests to the listener right before each E2E test reads /stats. Without this, the time between env setup (where the conftest's requests previously lived) and the agent's check invocation can span multiple of Envoy's 5s flush windows, by which point the histogram interval values have been reset to nan and the parser silently drops them. Also temporarily drop the metadata entry for envoy.vhost.vcluster.upstream_rq_time.99_5percentile to confirm CI now reliably catches missing metadata. * Restore conftest warm-up requests for integration tests The integration test (test_check) relies on Envoy having processed traffic before the check runs to assert metrics like envoy.cluster.ext_authz.error.count. Keep the dd_environment warm-up requests for that and have exercise_envoy re-fire just before each E2E scrape. * Use exercise_envoy fixture for integration tests too Move the Envoy listener warm-up out of dd_environment and into the function-scoped exercise_envoy fixture so it's shared by both the integration tests (which previously relied on a side-effect inside dd_environment) and the E2E tests. Single source of truth for "make sure Envoy has traffic before this test runs." * Wait for an Envoy stats flush after exercising the listener Firing the requests immediately before the agent's scrape isn't enough — Envoy only rolls samples into the histogram interval view at each 5s flush, and the parser drops percentiles whose interval value is nan. Sleep 6s so the scrape lands after the flush that captured the samples but before the next empty flush resets them. * Add envoy.vhost.vcluster.upstream_rq_time.99_5percentile to metadata.csv Envoy 1.14+ emits a 99.5th percentile by default for all histograms, including vhost.vcluster.upstream_rq_time. The other upstream_rq_time families (cluster, cluster.external, etc.) already carry this entry; this one was overlooked when those were added. * Drive continuous traffic for one full Envoy flush interval The previous single burst + 6s sleep relied on Envoy's flush cycle aligning with the test's request time. While that landed in the safe window in practice, the alignment isn't designed — it depends on docker_run timing happening to be a multiple of the flush interval. Spreading requests across the window removes that dependency: the most recent completed flush always has samples, so the interval percentiles are never reset to nan. * Temporarily remove 99_5percentile metadata to validate continuous-load fixture * Derive exercise_envoy timings from a flush-interval constant * Restore envoy.vhost.vcluster.upstream_rq_time.99_5percentile in metadata.csv * Document safe-scrape budget of exercise_envoy * Move exercise_envoy to a background thread Replace the synchronous loop+sleep fixture with a threading.Thread + Event so requests keep firing through the entire test, including while the agent's check is in flight. This removes the finite "safe scrape window" the previous approach relied on — every flush window during the test, including those that close mid-scrape, now has samples. Also drop the 99_5percentile metadata entry temporarily to validate the fixture continues to reliably trigger emission on master CI. * Restore envoy.vhost.vcluster.upstream_rq_time.99_5percentile in metadata.csv
1 parent 967373d commit 656fac7

6 files changed

Lines changed: 51 additions & 7 deletions

File tree

envoy/metadata.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -801,6 +801,7 @@ envoy.vhost.vcluster.upstream_rq_time.75percentile,gauge,,millisecond,,[Legacy]
801801
envoy.vhost.vcluster.upstream_rq_time.90percentile,gauge,,millisecond,,[Legacy] Request time milliseconds 90-percentile,-1,envoy,,
802802
envoy.vhost.vcluster.upstream_rq_time.95percentile,gauge,,millisecond,,[Legacy] Request time milliseconds 95-percentile,-1,envoy,,
803803
envoy.vhost.vcluster.upstream_rq_time.99percentile,gauge,,millisecond,,[Legacy] Request time milliseconds 99-percentile,-1,envoy,,
804+
envoy.vhost.vcluster.upstream_rq_time.99_5percentile,gauge,,millisecond,,[Legacy] Request time milliseconds 99.5-percentile,-1,envoy,,
804805
envoy.vhost.vcluster.upstream_rq_time.99_9percentile,gauge,,millisecond,,[Legacy] Request time milliseconds 99.9-percentile,-1,envoy,,
805806
envoy.vhost.vcluster.upstream_rq_time.100percentile,gauge,,millisecond,,[Legacy] Request time milliseconds 100-percentile,-1,envoy,,
806807
envoy.http.dynamodb.operation.upstream_rq_time.0percentile,gauge,,millisecond,,[Legacy] Time spent on operation_name tag 0-percentile,-1,envoy,,

envoy/tests/conftest.py

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
# Licensed under a 3-clause BSD style license (see LICENSE)
44
import copy
55
import os
6+
import threading
7+
import time
68

79
import pytest
810
import requests
@@ -13,6 +15,13 @@
1315
from .common import DEFAULT_INSTANCE, DOCKER_DIR, FIXTURE_DIR, HOST, URL
1416
from .legacy.common import FLAVOR, INSTANCES
1517

18+
# Envoy's default stats_flush_interval (seconds). The exercise_envoy
19+
# fixture drives traffic for one full interval so the most recent
20+
# completed flush window always has samples; if Envoy's default changes
21+
# (or we ever set the interval explicitly in the test bootstrap config),
22+
# update this constant and the fixture timings follow.
23+
ENVOY_STATS_FLUSH_INTERVAL = 5
24+
1625

1726
@pytest.fixture(scope='session')
1827
def fixture_path():
@@ -35,12 +44,42 @@ def dd_environment():
3544
attempts=5,
3645
attempts_wait=10,
3746
):
38-
# Exercising envoy a bit will trigger extra metrics
39-
requests.get('http://{}:8000/service/1'.format(HOST))
40-
requests.get('http://{}:8000/service/2'.format(HOST))
4147
yield instance
4248

4349

50+
@pytest.fixture
51+
def exercise_envoy():
52+
# Drive continuous traffic through Envoy's listener for the entire
53+
# lifetime of the test. A background thread keeps firing requests
54+
# until the fixture tears down, so every flush window — including
55+
# those that close while the agent's check is in flight — has
56+
# samples. Envoy's text /stats endpoint reports per-interval
57+
# quantile values that get recomputed on every flush; an empty
58+
# flush resets the interval percentiles to nan (see
59+
# hist_approx_quantile in libcircllhist), which the parser would
60+
# then filter out.
61+
stop = threading.Event()
62+
63+
def fire_loop():
64+
while not stop.is_set():
65+
try:
66+
requests.get('http://{}:8000/service/1'.format(HOST))
67+
requests.get('http://{}:8000/service/2'.format(HOST))
68+
except requests.RequestException:
69+
pass
70+
stop.wait(ENVOY_STATS_FLUSH_INTERVAL / 10)
71+
72+
thread = threading.Thread(target=fire_loop, daemon=True)
73+
thread.start()
74+
# Wait one full flush interval so the first non-empty flush rolls
75+
# samples into the interval percentile view before the test body
76+
# starts scraping.
77+
time.sleep(ENVOY_STATS_FLUSH_INTERVAL + 1)
78+
yield
79+
stop.set()
80+
thread.join(timeout=2)
81+
82+
4483
@pytest.fixture
4584
def check():
4685
return lambda instance: Envoy('envoy', {}, [instance])

envoy/tests/legacy/test_e2e.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@
282282
]
283283

284284

285-
def test_e2e(dd_agent_check):
285+
def test_e2e(dd_agent_check, exercise_envoy):
286286
instance = {"stats_url": "http://{}:8001/stats".format(HOST)}
287287
aggregator = dd_agent_check(instance, rate=True)
288288
for metric in METRICS:

envoy/tests/legacy/test_integration.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,11 @@
1212
CHECK_NAME = 'envoy'
1313
UNIQUE_METRICS = EXT_AUTHZ_METRICS + RBAC_METRICS
1414

15-
pytestmark = [pytest.mark.integration, pytest.mark.usefixtures('dd_environment'), pytest.mark.flaky]
15+
pytestmark = [
16+
pytest.mark.integration,
17+
pytest.mark.usefixtures('dd_environment', 'exercise_envoy'),
18+
pytest.mark.flaky,
19+
]
1620

1721

1822
def test_success(aggregator, check, dd_run_check):

envoy/tests/test_e2e.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121

2222

2323
@pytest.mark.e2e
24-
def test_e2e(dd_agent_check):
24+
def test_e2e(dd_agent_check, exercise_envoy):
2525
aggregator = dd_agent_check(DEFAULT_INSTANCE, rate=True)
2626

2727
for metric in PROMETHEUS_METRICS + LOCAL_RATE_LIMIT_METRICS + CONNECTION_LIMIT_METRICS + TLS_INSPECTOR_METRICS:

envoy/tests/test_integration.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
pytestmark = [
2323
requires_new_environment,
2424
pytest.mark.integration,
25-
pytest.mark.usefixtures('dd_environment'),
25+
pytest.mark.usefixtures('dd_environment', 'exercise_envoy'),
2626
pytest.mark.flaky,
2727
]
2828

0 commit comments

Comments
 (0)