Skip to content

fix: emit durable_function:true tag on first On-Demand cold start invocation#1185

Closed
jchrostek-dd wants to merge 2 commits intomainfrom
john/svls-8800
Closed

fix: emit durable_function:true tag on first On-Demand cold start invocation#1185
jchrostek-dd wants to merge 2 commits intomainfrom
john/svls-8800

Conversation

@jchrostek-dd
Copy link
Copy Markdown
Contributor

What

Fixes a bug (SVLS-8800) where aws.lambda.enhanced.invocation metric was missing the durable_function:true tag for the first invocation after a cold start in On-Demand Lambda mode.

Root Cause

In On-Demand mode, the Telemetry API delivers PlatformInitStart asynchronously — it arrives at the extension after the Invoke event from next_event(). The extension was emitting the invocation metric immediately on receiving the Invoke event, before on_platform_init_start had a chance to call set_durable_function_tag().

This race was already noted in a code comment (processor.rs:343) for span context purposes, but was not accounted for in the metric tag path.

Fix

Defer invocation metric emission for On-Demand cold starts until PlatformInitStart is received (which sets the durable_function tag). PlatformInitReport serves as a fallback if PlatformInitStart is never received.

SnapStart restores are explicitly excluded from the deferral path — they fire PlatformRestoreStart/PlatformRestoreReport instead of PlatformInitStart/PlatformInitReport, so deferring would cause the metric to be silently dropped.

Changes

  • bottlecap/src/lifecycle/invocation/processor.rs: Added platform_init_start_received flag and pending_invocation_metric_timestamp field to Processor. On-Demand cold start defers metric; on_platform_init_start emits it after setting the tag. SnapStart excluded from deferral.

Tests

Unit tests (3 new, 525 total pass):

  • test_durable_function_tag_present_on_first_on_demand_cold_start_invocation — simulates the race, asserts durable_function:true present
  • test_invocation_metric_emitted_after_platform_init_start_for_regular_runtime — same race, non-durable runtime, no durable_function tag
  • test_invocation_metric_emitted_immediately_for_snapstart_restore — asserts SnapStart emits immediately, no deferral

Integration tests (new, all pass against real AWS):

  • should emit aws.lambda.enhanced.invocation metric for first cold-start invocation — real Lambda cold start, real Datadog metric
  • invocation metric should have a positive count
  • should NOT emit a metric tagged durable_function:true for a non-durable runtime

Jira: https://datadoghq.atlassian.net/browse/SVLS-8800

…ocation

Fixes SVLS-8800. In On-Demand mode, PlatformInitStart telemetry arrives
after the Invoke event, so the invocation metric was emitted before the
durable_function:true tag could be set.

Fix: defer metric emission on On-Demand cold start until PlatformInitStart
(or PlatformInitReport as fallback) is received. SnapStart restores are
explicitly excluded from deferral (they fire PlatformRestoreStart/Report,
not PlatformInitStart).

New unit tests cover: durable race scenario, non-durable race, and
SnapStart immediate emission. Integration tests validate real AWS behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant