fix: emit durable_function:true tag on first On-Demand cold start invocation#1185
Closed
jchrostek-dd wants to merge 2 commits intomainfrom
Closed
fix: emit durable_function:true tag on first On-Demand cold start invocation#1185jchrostek-dd wants to merge 2 commits intomainfrom
jchrostek-dd wants to merge 2 commits intomainfrom
Conversation
…ocation Fixes SVLS-8800. In On-Demand mode, PlatformInitStart telemetry arrives after the Invoke event, so the invocation metric was emitted before the durable_function:true tag could be set. Fix: defer metric emission on On-Demand cold start until PlatformInitStart (or PlatformInitReport as fallback) is received. SnapStart restores are explicitly excluded from deferral (they fire PlatformRestoreStart/Report, not PlatformInitStart). New unit tests cover: durable race scenario, non-durable race, and SnapStart immediate emission. Integration tests validate real AWS behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes a bug (SVLS-8800) where
aws.lambda.enhanced.invocationmetric was missing thedurable_function:truetag for the first invocation after a cold start in On-Demand Lambda mode.Root Cause
In On-Demand mode, the Telemetry API delivers
PlatformInitStartasynchronously — it arrives at the extension after theInvokeevent fromnext_event(). The extension was emitting the invocation metric immediately on receiving the Invoke event, beforeon_platform_init_starthad a chance to callset_durable_function_tag().This race was already noted in a code comment (
processor.rs:343) for span context purposes, but was not accounted for in the metric tag path.Fix
Defer invocation metric emission for On-Demand cold starts until
PlatformInitStartis received (which sets the durable_function tag).PlatformInitReportserves as a fallback ifPlatformInitStartis never received.SnapStart restores are explicitly excluded from the deferral path — they fire
PlatformRestoreStart/PlatformRestoreReportinstead ofPlatformInitStart/PlatformInitReport, so deferring would cause the metric to be silently dropped.Changes
bottlecap/src/lifecycle/invocation/processor.rs: Addedplatform_init_start_receivedflag andpending_invocation_metric_timestampfield toProcessor. On-Demand cold start defers metric;on_platform_init_startemits it after setting the tag. SnapStart excluded from deferral.Tests
Unit tests (3 new, 525 total pass):
test_durable_function_tag_present_on_first_on_demand_cold_start_invocation— simulates the race, assertsdurable_function:truepresenttest_invocation_metric_emitted_after_platform_init_start_for_regular_runtime— same race, non-durable runtime, no durable_function tagtest_invocation_metric_emitted_immediately_for_snapstart_restore— asserts SnapStart emits immediately, no deferralIntegration tests (new, all pass against real AWS):
should emit aws.lambda.enhanced.invocation metric for first cold-start invocation— real Lambda cold start, real Datadog metricinvocation metric should have a positive countshould NOT emit a metric tagged durable_function:true for a non-durable runtimeJira: https://datadoghq.atlassian.net/browse/SVLS-8800