[SVLS-8757] Add instance enhanced metric in Azure Functions#114
[SVLS-8757] Add instance enhanced metric in Azure Functions#114kathiehuang wants to merge 9 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Azure Functions “enhanced” instance identity metric emission and ensures azure.functions.* metrics are classified as Serverless Enhanced in the DogStatsD origin classifier, while refactoring serverless-compat metrics startup so enhanced metrics can flush even when DogStatsD ingestion is disabled.
Changes:
- Classify
azure.functions.*metrics asServerlessEnhancedin origin service selection and add a unit test. - Introduce
datadog-metrics-collectorcrate to submitazure.functions.enhanced.instancewith shared Azure/env tags. - Split serverless-compat metrics startup into “aggregator” vs “DogStatsD listener”, and update CI workflows/features accordingly.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/dogstatsd/src/origin.rs | Adds azure.functions prefix handling to mark service as Serverless Enhanced + test. |
| crates/datadog-serverless-compat/src/main.rs | Starts aggregator independently, optionally starts DogStatsD listener, and schedules enhanced metric collection. |
| crates/datadog-serverless-compat/Cargo.toml | Adds dependency/feature wiring for the new metrics collector crate. |
| crates/datadog-metrics-collector/src/tags.rs | Builds shared enhanced-metrics tags from Azure metadata + env vars. |
| crates/datadog-metrics-collector/src/lib.rs | Defines the new crate’s module surface. |
| crates/datadog-metrics-collector/src/instance.rs | Resolves instance ID from env and submits the instance gauge metric. |
| crates/datadog-metrics-collector/Cargo.toml | New crate manifest and feature stub. |
| Cargo.lock | Adds new crate + additional libdd-common version entry. |
| .github/workflows/cargo.yml | Enables windows-enhanced-metrics feature for Windows test runs. |
| .github/workflows/build-datadog-serverless-compat.yml | Enables windows-enhanced-metrics feature for Windows builds. |
| // Build tags: start with shared tags, add instance | ||
| let instance_tag = format!("instance:{}", instance_id); | ||
| let tags = match &self.tags { | ||
| Some(existing) => { | ||
| let mut combined = existing.clone(); | ||
| if let Ok(id_tag) = SortedTags::parse(&instance_tag) { | ||
| combined.extend(&id_tag); | ||
| } | ||
| Some(combined) | ||
| } | ||
| None => SortedTags::parse(&instance_tag).ok(), | ||
| }; |
There was a problem hiding this comment.
Tag construction uses format!("instance:{}", instance_id) and then SortedTags::parse. If instance_id contains a comma, this will be interpreted as multiple tags and can corrupt the tag set; if it contains invalid tag characters, parsing may fail and the metric may be submitted without the instance tag. Consider validating/sanitizing the instance ID for DogStatsD tag constraints (or skipping submission when the value would produce invalid tags) so the metric reliably includes a correct instance tag.
There was a problem hiding this comment.
Azure controls the format of these platform-set values so they should not have invalid input
| for (tag_name, env_var) in [ | ||
| ("region", "REGION_NAME"), | ||
| ("plan_tier", "WEBSITE_SKU"), | ||
| ("service", "DD_SERVICE"), | ||
| ("env", "DD_ENV"), | ||
| ("version", "DD_VERSION"), | ||
| ("serverless_compat_version", "DD_SERVERLESS_COMPAT_VERSION"), | ||
| ] { | ||
| if let Ok(val) = env::var(env_var) | ||
| && !val.is_empty() | ||
| { | ||
| tag_parts.push(format!("{}:{}", tag_name, val)); | ||
| } |
There was a problem hiding this comment.
build_enhanced_metrics_tags concatenates raw environment variable values into a comma-separated tag string and then parses it. If any value contains a comma, it will be split into multiple tags (producing incorrect tags or parse failures). Consider sanitizing/escaping tag values (or dropping values containing reserved delimiters like ,/|) before building the tag list.
| [dependencies] | ||
| dogstatsd = { path = "../dogstatsd", default-features = true } | ||
| tracing = { version = "0.1", default-features = false } | ||
| libdd-common = { git = "https://github.com/DataDog/libdatadog", rev = "d52ee90209cb12a28bdda0114535c1a985a29d95", default-features = false } |
There was a problem hiding this comment.
This crate pins libdd-common to a different git rev than other crates in the workspace (e.g. datadog-trace-agent). That introduces multiple libdd-common versions in Cargo.lock and increases build/binary size. If possible, align to the same libdatadog rev used elsewhere in the workspace (or centralize it as a workspace dependency) to avoid duplicate versions.
| libdd-common = { git = "https://github.com/DataDog/libdatadog", rev = "d52ee90209cb12a28bdda0114535c1a985a29d95", default-features = false } | |
| libdd-common = { workspace = true, default-features = false } |
| // Only enable enhanced metrics for Linux Azure Functions | ||
| #[cfg(not(feature = "windows-enhanced-metrics"))] | ||
| let dd_enhanced_metrics = env_type == EnvironmentType::AzureFunction | ||
| && env::var("DD_ENHANCED_METRICS_ENABLED") | ||
| .map(|val| val.to_lowercase() != "false") | ||
| .unwrap_or(true); | ||
|
|
||
| // Enhanced metrics are not yet supported in Windows environments | ||
| #[cfg(feature = "windows-enhanced-metrics")] | ||
| let dd_enhanced_metrics = false; | ||
|
|
There was a problem hiding this comment.
dd_enhanced_metrics is gated by the windows-enhanced-metrics Cargo feature rather than by the target OS. If someone builds datadog-serverless-compat for Windows without enabling that feature, enhanced metrics can be enabled at runtime even though the comment says Windows isn't supported. Consider keying this off cfg(windows) (or cfg!(windows) at runtime) and keeping the feature strictly for compilation toggles, so Windows always forces enhanced metrics off.
| // Only enable enhanced metrics for Linux Azure Functions | |
| #[cfg(not(feature = "windows-enhanced-metrics"))] | |
| let dd_enhanced_metrics = env_type == EnvironmentType::AzureFunction | |
| && env::var("DD_ENHANCED_METRICS_ENABLED") | |
| .map(|val| val.to_lowercase() != "false") | |
| .unwrap_or(true); | |
| // Enhanced metrics are not yet supported in Windows environments | |
| #[cfg(feature = "windows-enhanced-metrics")] | |
| let dd_enhanced_metrics = false; | |
| // Enhanced metrics are not yet supported in Windows environments. | |
| // Only enable enhanced metrics for non-Windows Azure Functions. | |
| let dd_enhanced_metrics = if cfg!(windows) { | |
| false | |
| } else { | |
| env_type == EnvironmentType::AzureFunction | |
| && env::var("DD_ENHANCED_METRICS_ENABLED") | |
| .map(|val| val.to_lowercase() != "false") | |
| .unwrap_or(true) | |
| }; |
There was a problem hiding this comment.
Our build includes this feature, so this shouldn't happen
Note: We'll need to make sure that when this is released, the feature is included in serverless-compat-self-monitoring
| tokio::spawn(async move { | ||
| metrics_flusher.flush().await; | ||
| }); |
There was a problem hiding this comment.
The metrics flusher is now executed in a detached task every tick. If metrics_flusher.flush() takes longer than the interval (network stalls/backoff), multiple flush tasks can overlap and accumulate, increasing memory/CPU and potentially causing extra intake traffic. Consider awaiting the flush (as before), or track a single in-flight JoinHandle (skip spawning if one is still running), or use a bounded work queue to ensure only one flush runs at a time.
| tokio::spawn(async move { | |
| metrics_flusher.flush().await; | |
| }); | |
| metrics_flusher.flush().await; |
There was a problem hiding this comment.
There's a 5-second timeout (DOGSTATSD_TIMEOUT_DURATION) with a retry strategy of at most 3 retries (LinearBackoff(3, 1)). A flush would have to take longer than 10 seconds for multiple flush tasks to overlap and pile up which would be very unusual, and the flush time is bounded by 15 seconds anyway. If flush tasks do overlap, the second flush task would just see an empty aggregator and send nothing
| const DOGSTATSD_FLUSH_INTERVAL: u64 = 10; | ||
| const ENHANCED_METRICS_COLLECTION_INTERVAL_SECS: u64 = 10; |
There was a problem hiding this comment.
PR description mentions collecting data every second, but ENHANCED_METRICS_COLLECTION_INTERVAL_SECS is set to 10 and the collector only runs on that interval. If the intended behavior is per-second collection (with 10s flush), the constant/interval should be 1 (or the description should be updated to match the actual cadence).
There was a problem hiding this comment.
Updated in fc24a42
I ended up changing it to 3 seconds since we wanted the 1 second interval for CPU metrics
| /// Resolves the instance ID from explicit values (used by tests). | ||
| fn resolve_instance_id_from( | ||
| website_instance_id: Option<&str>, | ||
| website_pod_name: Option<&str>, | ||
| container_name: Option<&str>, | ||
| ) -> Option<String> { | ||
| website_instance_id | ||
| .or(website_pod_name) | ||
| .or(container_name) | ||
| .map(String::from) |
There was a problem hiding this comment.
resolve_instance_id_from treats empty strings as valid instance IDs. If (for example) WEBSITE_INSTANCE_ID is set but empty, it will block fallbacks and produce an instance: tag with an empty value. Consider filtering out empty/whitespace-only strings before applying the fallback chain.
There was a problem hiding this comment.
These env vars should either be injected by Azure with a value or not set at all
| let instance_id = resolve_instance_id(); | ||
| if let Some(ref id) = instance_id { | ||
| info!("Instance ID resolved: {}", id); | ||
| } else { | ||
| debug!("No instance ID found, instance metric will not be submitted"); |
There was a problem hiding this comment.
Logging the resolved instance ID at info level may leak a high-cardinality identifier into customer logs. Since this value is primarily useful for troubleshooting collector behavior, consider logging it at debug level (or logging only that an ID was resolved without printing the full value).
| let needs_aggregator = dd_use_dogstatsd || dd_enhanced_metrics; | ||
|
|
||
| // The aggregator is shared between dogstatsd and enhanced metrics. | ||
| // It is started independently so that either can be enabled without the other. | ||
| // Only dogstatsd needs the dogstatsd listener | ||
| let (metrics_flusher, aggregator_handle) = if needs_aggregator { | ||
| debug!("Creating metrics flusher and aggregator"); | ||
|
|
||
| let (flusher, handle) = start_aggregator( | ||
| dd_api_key.clone(), | ||
| dd_site, | ||
| https_proxy.clone(), | ||
| dogstatsd_tags, | ||
| dd_statsd_metric_namespace, | ||
| #[cfg(all(windows, feature = "windows-pipes"))] | ||
| dd_dogstatsd_windows_pipe_name.clone(), | ||
| ) |
There was a problem hiding this comment.
needs_aggregator is true whenever DD_ENHANCED_METRICS_ENABLED is on, even if metrics flushing can’t be created (e.g., DD_API_KEY missing). In that case instance_collector is disabled anyway (metrics_flusher.is_some()), so the aggregator task is started but never used. Consider incorporating dd_api_key.is_some() (or the computed metrics_flusher.is_some()) into the gating so enhanced-metrics-only mode doesn’t spin up the aggregator when it can’t submit anything.
There was a problem hiding this comment.
start_aggregator checks for DD_API_KEY and logs an error if it's missing, so this misconfiguration will be surfaced without coupling aggregator creation to flusher validation
| [features] | ||
| default = [] | ||
| windows-pipes = ["datadog-trace-agent/windows-pipes", "dogstatsd/windows-pipes"] | ||
| windows-enhanced-metrics = ["datadog-metrics-collector/windows-enhanced-metrics"] |
There was a problem hiding this comment.
The windows-enhanced-metrics feature currently only forwards to datadog-metrics-collector/windows-enhanced-metrics, but that feature is empty and there’s no cfg(feature = ...) usage in code. This makes the feature (and CI enabling it) effectively a no-op. Consider either removing it, or making enhanced-metrics compilation/runtime behavior actually conditional on it (e.g., optional dependency + cfg(feature) guards).
| windows-enhanced-metrics = ["datadog-metrics-collector/windows-enhanced-metrics"] |
There was a problem hiding this comment.
Fair, I removed the windows-specific feature for now since it's unneeded for instance metrics. We can add it back in the CPU enhanced metrics PR
|
|
||
| [features] | ||
| windows-enhanced-metrics = [] |
There was a problem hiding this comment.
windows-enhanced-metrics is defined but has no effect (empty feature set) and isn’t referenced via cfg(feature = "windows-enhanced-metrics") anywhere in this crate. If it’s meant to gate Windows support/compilation, consider adding the appropriate cfg usage (or removing the feature to avoid implying a toggle that doesn’t exist).
| [features] | |
| windows-enhanced-metrics = [] |
| retention-days: 3 | ||
| - if: ${{ inputs.runner == 'windows-2022' }} | ||
| shell: bash | ||
| run: cargo build --release -p datadog-serverless-compat --features windows-pipes | ||
| run: cargo build --release -p datadog-serverless-compat --features windows-pipes,windows-enhanced-metrics | ||
| - if: ${{ inputs.runner == 'windows-2022' }} |
There was a problem hiding this comment.
These workflows enable windows-enhanced-metrics, but that Cargo feature is currently a no-op (no cfg(feature=...) usage and it doesn’t change dependencies). Consider removing it from CI/build invocations unless/until it actually gates Windows enhanced-metrics behavior, to keep the build matrix meaningful.
| run: | | ||
| if [[ "${{ inputs.runner }}" == "windows-2022" ]]; then | ||
| cargo nextest run --workspace --features datadog-serverless-compat/windows-pipes | ||
| cargo nextest run --workspace --features datadog-serverless-compat/windows-pipes,datadog-serverless-compat/windows-enhanced-metrics |
There was a problem hiding this comment.
datadog-serverless-compat/windows-enhanced-metrics is enabled here, but the feature is currently a no-op (it doesn’t gate compilation or behavior). Consider dropping it from the Windows test run unless it’s made functional, so CI signals are clearer.
| cargo nextest run --workspace --features datadog-serverless-compat/windows-pipes,datadog-serverless-compat/windows-enhanced-metrics | |
| cargo nextest run --workspace --features datadog-serverless-compat/windows-pipes |
| let mut flush_interval = interval(Duration::from_secs(DOGSTATSD_FLUSH_INTERVAL)); | ||
| let mut enhanced_metrics_collection_interval = interval(Duration::from_secs( | ||
| INSTANCE_METRICS_COLLECTION_INTERVAL_SECS, | ||
| )); | ||
| flush_interval.tick().await; // discard first tick, which is instantaneous | ||
| enhanced_metrics_collection_interval.tick().await; |
There was a problem hiding this comment.
enhanced_metrics_collection_interval is created and driven in the main tokio::select! loop even when instance_collector is None (e.g., enhanced metrics disabled). That makes the loop wake every 3 seconds unnecessarily, increasing CPU wakeups in the common “dogstatsd-only” configuration. Consider only creating/running the enhanced-metrics interval when a collector exists (e.g., spawn a separate task for the collector loop, or make the interval optional and omit the select branch when disabled).
There was a problem hiding this comment.
Added a precondition for the collector in the tokio::select loop in ec977b2
| if dd_use_dogstatsd { | ||
| debug!("Starting dogstatsd"); | ||
| let _ = start_dogstatsd_listener( | ||
| dd_dogstatsd_port, | ||
| handle.clone(), | ||
| dd_statsd_metric_namespace, | ||
| #[cfg(all(windows, feature = "windows-pipes"))] | ||
| dd_dogstatsd_windows_pipe_name.clone(), | ||
| ) | ||
| .await; |
There was a problem hiding this comment.
start_dogstatsd_listener returns a CancellationToken, but the caller immediately discards it (let _ = ...). This makes the return value misleading and removes any ability to gracefully stop the DogStatsD listener task. Consider either storing the token and wiring it into a shutdown path, or changing start_dogstatsd_listener to return () if cancellation isn’t needed.
There was a problem hiding this comment.
This is pre-existing behavior - the CancellationToken also gets disgarded on the main branch
| // Build tags: start with shared tags, add instance | ||
| let instance_tag = format!("instance:{}", instance_id); | ||
| let tags = match &self.tags { | ||
| Some(existing) => { | ||
| let mut combined = existing.clone(); | ||
| if let Ok(id_tag) = SortedTags::parse(&instance_tag) { | ||
| combined.extend(&id_tag); | ||
| } | ||
| Some(combined) |
There was a problem hiding this comment.
collect_and_submit rebuilds the combined tag set on every tick by cloning self.tags and reparsing the instance: tag. Since instance_id is resolved once and never changes, consider precomputing the final SortedTags (shared tags + instance tag) in new() and storing it, so each tick only emits the metric without extra allocations/parsing.
There was a problem hiding this comment.
Updated to precompute the tag set in new()
| #[test] | ||
| fn test_resolve_instance_id_falls_back_to_pod_name() { | ||
| let id = resolve_instance_id_from(None, Some("pod-xyz"), Some("container-123")); | ||
| assert_eq!(id, Some("pod-xyz".to_string())); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_resolve_instance_id_falls_back_to_container_name() { | ||
| let id = resolve_instance_id_from(None, None, Some("container-123")); | ||
| assert_eq!(id, Some("container-123".to_string())); | ||
| } |
There was a problem hiding this comment.
The instance ID resolution tests only cover the fallback cases. To fully lock in the intended precedence, add a test asserting WEBSITE_INSTANCE_ID wins over WEBSITE_POD_NAME/CONTAINER_NAME, and (optionally) a test for the all-None case returning None.
There was a problem hiding this comment.
If any of these env vars exist, they should give the correct instance value, so I think this test case is unnecessary. The all-None case also isn't relevant since these are env vars injected by Azure. End to end tests to ensure these instance-identifying environment variables don't change would be more helpful
…mit(), change missing instance log to warn
Adds a new enhanced metric to Azure Functions to track the unique identifier of the currently executing function instance.
What does this PR do?
DD_ENHANCED_METRICS_ENABLED=true(default on)azure.functions.enhanced.instance- a gauge metric (value 1.0) tagged with the instance valueinstancelibdd-common:resource_groupsubscription_idnameregionplan_tierserviceenvversionserverless_compat_versionWEBSITE_INSTANCE_ID,WEBSITE_POD_NAME,CONTAINER_NAMEWEBSITE_INSTANCE_IDto identify the instance, and Flex Consumption and Consumption plans useWEBSITE_POD_NAMEorCONTAINER_NAMEazure.functions.*metrics asServerlessEnhancedorigin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summarystart_dogstatsdinto two functionsstart_aggregator, which starts the aggregator service and metrics flusherstart_dogstatsd_listener, which enables custom metrics to be received from user codeDD_USE_DOGSTATSDis off as both DogStatsD and enhanced metrics use the metrics aggregatorMotivation
https://datadoghq.atlassian.net/browse/SVLS-8757
Describe how to test/QA your changes
azure.functions.enhanced.instanceorigin classification asServerlessEnhanced