Skip to content

Commit 06d0872

Browse files
lym953claude
andauthored
[SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup (#1241)
## Background From our knowledge (before this PR), here's the behavior when each runtime OOMs: - emits runtime-specific error message. This can happen on **Java**, **Node** (case 1 in the table below) and **.NET** - In `PlatformRuntimeDone` event, `error_type` is `Runtime.OutOfMemory`. This can happen on **Python** and **Ruby**. - In `PlatformReport` event, `max_memory_used == memory_size`. This can happen on **Python**, **Ruby**, **Node** and **Go**. To capture OOM for all these scenarios (except Node case 2, which was just called out in #1237) without double counting, right now the extension emits `aws.lambda.enhanced.out_of_memory` metric in these scenarios: - when we see runtime-specific error messages for Java, Node and .NET - when we see `Runtime.OutOfMemory` - when we see `max_memory_used == memory_size` for Go, i.e. only when runtime is `provided.al2`. We don't do this for other runtimes (Python, Ruby, Node) to avoid double counting. <img width="768" height="323" alt="image" src="https://github.com/user-attachments/assets/549a8820-6b86-462d-a857-0269d2990a02" /> ## Problem In issue #1237, a customer called out a new scenario: "Node (case 2)" in the table. The only evidence of OOM is `max_memory_used == memory_size`, and there is no runtime-specific log message. As a result, OOMs like this are not captured by the OOM enhanced metric. ## This PR - Regardless of runtime, use all the three ways to capture OOM. - In addition, dedup by request_id to avoid double counting. - Add one integration test per runtime (except for Node, which has 2 tests) ## Test plan Passed the added unit tests and integration tests. ## To reviewers Most of the code changes are for integration tests. ## Details (generated by Claude Code) Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory limit (`Memory Size 192 MB / Max Memory Used 192 MB`, `Status: timeout`) did not emit `aws.lambda.enhanced.out_of_memory` because none of the three existing detection paths matched. - **Why the existing paths missed it.** V8 spent its budget in GC rather than declaring `JavaScript heap out of memory`, so the runtime log-line match never fired. The runtime crashed on a wall-clock timeout, so `PlatformRuntimeDone` reported no `error_type`. And the `max_memory_used_mb == memory_size_mb` check in `PlatformReport` was gated on `runtime.starts_with("provided.al")` to avoid double-counting against the log path, so Node was excluded. - **What changes.** Drop the `provided.al*` restriction so the equality check applies to every runtime. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and `Runtime.OutOfMemory` simultaneously), add a per-`Context` `oom_emitted` flag. All three detection paths funnel through a new `Processor::try_increment_oom_metric`, which checks/sets the flag and is a no-op on subsequent calls for the same `request_id`. - **Plumbing.** `Event::OutOfMemory` now carries an `Option<String> request_id`. The log-path detector reads it from `LambdaProcessor::invocation_context.request_id` (set on `PlatformStart`, cleared on `PlatformRuntimeDone`/`PlatformReport`). `None` is only realistic in Managed Instance mode (extensions can't subscribe to INVOKE there); the helper falls back to a best-effort emit without dedup in that case. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a5742fc commit 06d0872

32 files changed

Lines changed: 1271 additions & 39 deletions

File tree

.gitlab/datasources/test-suites.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ test_suites:
44
- name: snapstart
55
- name: lmi
66
- name: auth
7+
- name: oom
8+
- name: lmi-oom

.gitlab/templates/pipeline.yaml.tpl

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -505,6 +505,40 @@ build node lambdas:
505505
- cd integration-tests
506506
- ./scripts/build-node.sh
507507

508+
build ruby lambdas:
509+
stage: integration-tests
510+
image: registry.ddbuild.io/images/docker:27.3.1
511+
tags: ["docker-in-docker:arm64"]
512+
rules:
513+
- when: on_success
514+
needs: []
515+
artifacts:
516+
expire_in: 1 hour
517+
paths:
518+
- integration-tests/lambda/*/*.rb
519+
script:
520+
- cd integration-tests
521+
- ./scripts/build-ruby.sh
522+
523+
build go lambdas:
524+
stage: integration-tests
525+
image: registry.ddbuild.io/images/docker:27.3.1
526+
tags: ["docker-in-docker:arm64"]
527+
rules:
528+
- when: on_success
529+
needs: []
530+
cache:
531+
key: go-mod-cache-${CI_COMMIT_REF_SLUG}
532+
paths:
533+
- integration-tests/.cache/go-mod/
534+
artifacts:
535+
expire_in: 1 hour
536+
paths:
537+
- integration-tests/lambda/*/bin/bootstrap
538+
script:
539+
- cd integration-tests
540+
- ./scripts/build-go.sh
541+
508542
# Integration Tests - Publish arm64 layer with integration test prefix
509543
publish integration layer (arm64):
510544
stage: integration-tests
@@ -581,12 +615,16 @@ integration-suite:
581615
- build dotnet lambdas
582616
- build python lambdas
583617
- build node lambdas
618+
- build ruby lambdas
619+
- build go lambdas
584620
dependencies:
585621
- publish integration layer (arm64)
586622
- build java lambdas
587623
- build dotnet lambdas
588624
- build python lambdas
589625
- build node lambdas
626+
- build ruby lambdas
627+
- build go lambdas
590628
variables:
591629
IDENTIFIER: ${CI_COMMIT_SHORT_SHA}
592630
AWS_DEFAULT_REGION: us-east-1

bottlecap/src/bin/bottlecap/main.rs

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -841,9 +841,12 @@ async fn handle_event_bus_event(
841841
stats_concentrator: StatsConcentratorHandle,
842842
) -> Option<TelemetryEvent> {
843843
match event {
844-
Event::OutOfMemory(event_timestamp) => {
844+
Event::OutOfMemory {
845+
request_id,
846+
timestamp,
847+
} => {
845848
if let Err(e) = invocation_processor_handle
846-
.on_out_of_memory_error(event_timestamp)
849+
.on_out_of_memory_error(request_id, timestamp)
847850
.await
848851
{
849852
error!("Failed to send out of memory error to processor: {}", e);

bottlecap/src/event_bus/mod.rs

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,13 @@ mod constants;
77
#[derive(Debug)]
88
pub enum Event {
99
Telemetry(TelemetryEvent),
10-
OutOfMemory(i64),
10+
OutOfMemory {
11+
/// Lambda `request_id` of the invocation the OOM belongs to, when known.
12+
/// Used by the invocation processor to dedupe against other OOM detection
13+
/// paths (`PlatformRuntimeDone` `error_type`, `PlatformReport` memory equality).
14+
request_id: Option<String>,
15+
timestamp: i64,
16+
},
1117
Tombstone,
1218
}
1319

bottlecap/src/lifecycle/invocation/context.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,12 @@ pub struct Context {
4343
/// tracing.
4444
///
4545
pub extracted_span_context: Option<SpanContext>,
46+
/// Whether the `aws.lambda.enhanced.out_of_memory` metric has already been
47+
/// emitted for this invocation. Multiple detection paths can fire for the
48+
/// same OOM (runtime log, `Runtime.OutOfMemory` `error_type` in
49+
/// `PlatformRuntimeDone`, `max_memory_used == memory_size` in `PlatformReport`);
50+
/// this flag dedupes them.
51+
pub oom_emitted: bool,
4652
}
4753

4854
/// Struct containing the information needed to reparent a span.
@@ -94,6 +100,7 @@ impl Default for Context {
94100
snapstart_restore_span: None,
95101
tracer_span: None,
96102
extracted_span_context: None,
103+
oom_emitted: false,
97104
}
98105
}
99106
}

bottlecap/src/lifecycle/invocation/processor.rs

Lines changed: 189 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -508,7 +508,7 @@ impl Processor {
508508
debug!(
509509
"Invocation Processor | PlatformRuntimeDone | Got Runtime.OutOfMemory. Incrementing OOM metric."
510510
);
511-
self.enhanced_metrics.increment_oom_metric(timestamp);
511+
self.try_increment_oom_metric(Some(request_id), timestamp);
512512
}
513513
}
514514

@@ -909,25 +909,25 @@ impl Processor {
909909

910910
/// Handles `OnDemand` mode platform report processing.
911911
///
912-
/// Processes OnDemand-specific metrics including OOM detection for provided.al runtimes
913-
/// and post-runtime duration calculation.
912+
/// Processes OnDemand-specific metrics including OOM detection by memory-size
913+
/// equality and post-runtime duration calculation.
914914
fn handle_ondemand_report(
915915
&mut self,
916916
request_id: &String,
917917
metrics: OnDemandReportMetrics,
918918
timestamp: i64,
919919
) {
920-
// For provided.al runtimes, if the last invocation hit the memory limit, increment the OOM metric.
921-
// We do this for provided.al runtimes because we didn't find another way to detect this under provided.al.
922-
// We don't do this for other runtimes to avoid double counting.
923-
if let Some(runtime) = &self.runtime
924-
&& runtime.starts_with("provided.al")
925-
&& metrics.max_memory_used_mb == metrics.memory_size_mb
926-
{
920+
// If the invocation hit the memory limit, increment the OOM metric. This catches
921+
// OOM-induced failures that don't surface through a runtime-specific log line or a
922+
// `Runtime.OutOfMemory` error_type — most notably the suppressed-init / timeout-at-cap
923+
// pattern reported in datadog-lambda-extension#1237 (Node). Best-effort dedup
924+
// against the other two detection paths is handled by `try_increment_oom_metric`
925+
// (it can still double-count in edge cases — see that function's doc).
926+
if metrics.max_memory_used_mb == metrics.memory_size_mb {
927927
debug!(
928928
"Invocation Processor | PlatformReport | Last invocation hit memory limit. Incrementing OOM metric."
929929
);
930-
self.enhanced_metrics.increment_oom_metric(timestamp);
930+
self.try_increment_oom_metric(Some(request_id), timestamp);
931931
}
932932

933933
// Calculate and set post-runtime duration if context is available
@@ -1395,7 +1395,52 @@ impl Processor {
13951395
Some(error_tags)
13961396
}
13971397

1398-
pub fn on_out_of_memory_error(&mut self, timestamp: i64) {
1398+
pub fn on_out_of_memory_error(&mut self, request_id: Option<&String>, timestamp: i64) {
1399+
self.try_increment_oom_metric(request_id, timestamp);
1400+
}
1401+
1402+
/// Best-effort dedup wrapper around `enhanced_metrics.increment_oom_metric`.
1403+
/// The metric MAY be double-counted in edge cases — see below.
1404+
///
1405+
/// Several detection paths can fire for the same invocation:
1406+
/// 1. A runtime-specific OOM log line (logs processor → `Event::OutOfMemory`)
1407+
/// 2. `error_type == "Runtime.OutOfMemory"` in `PlatformRuntimeDone`
1408+
/// 3. `max_memory_used_mb == memory_size_mb` in `PlatformReport`
1409+
///
1410+
/// When `request_id` is supplied AND the matching context is still in the
1411+
/// buffer, the per-invocation `Context::oom_emitted` flag guarantees one
1412+
/// emission per `request_id`. The metric is double-counted when either:
1413+
/// - `request_id` is `None` (log line beat `PlatformStart` to
1414+
/// `LambdaProcessor`, or it landed after `PlatformRuntimeDone` cleared
1415+
/// the slot) and another path subsequently emits with `Some(rid)`; or
1416+
/// - the context has been evicted from the buffer (capacity is fixed —
1417+
/// see `MAX_CONTEXT_BUFFER_SIZE`) between `PlatformStart` and this
1418+
/// call, so the flag has nowhere to live.
1419+
///
1420+
/// Both branches still emit (so OOMs are never under-counted) and log a
1421+
/// `debug!` line.
1422+
fn try_increment_oom_metric(&mut self, request_id: Option<&String>, timestamp: i64) {
1423+
if let Some(rid) = request_id {
1424+
if let Some(ctx) = self.context_buffer.get_mut(rid) {
1425+
if ctx.oom_emitted {
1426+
debug!(
1427+
"Invocation Processor | OOM metric already emitted for request_id {}, skipping",
1428+
rid
1429+
);
1430+
return;
1431+
}
1432+
ctx.oom_emitted = true;
1433+
} else {
1434+
debug!(
1435+
"Invocation Processor | Emitting OOM metric without dedup: context not found for request_id {} (likely evicted from context buffer)",
1436+
rid
1437+
);
1438+
}
1439+
} else {
1440+
debug!(
1441+
"Invocation Processor | Emitting OOM metric without dedup: no request_id available (OOM log processed before PlatformStart or after PlatformRuntimeDone)"
1442+
);
1443+
}
13991444
self.enhanced_metrics.increment_oom_metric(timestamp);
14001445
}
14011446

@@ -2445,4 +2490,136 @@ mod tests {
24452490
"pre-existing _dd.appsec.enabled value must not be overwritten"
24462491
);
24472492
}
2493+
2494+
/// Two OOM signals for the same `request_id` increment the metric exactly once.
2495+
/// Exercises the `Context::oom_emitted` dedup flag.
2496+
#[tokio::test]
2497+
async fn test_try_increment_oom_metric_dedupes_same_request_id() {
2498+
let mut p = setup();
2499+
// Insert the context directly so we don't go through `on_invoke_event`, which
2500+
// would populate dynamic tags (`cold_start:true`) and complicate the query.
2501+
let request_id = String::from("req-dedup");
2502+
p.context_buffer.start_context(&request_id, Span::default());
2503+
2504+
let now: i64 = std::time::UNIX_EPOCH
2505+
.elapsed()
2506+
.expect("clock")
2507+
.as_secs()
2508+
.try_into()
2509+
.unwrap_or_default();
2510+
2511+
p.on_out_of_memory_error(Some(&request_id), now);
2512+
p.on_out_of_memory_error(Some(&request_id), now);
2513+
2514+
let ts = (now / 10) * 10;
2515+
let entry = p
2516+
.enhanced_metrics
2517+
.aggr_handle
2518+
.get_entry_by_id(
2519+
crate::metrics::enhanced::constants::OUT_OF_MEMORY_METRIC.into(),
2520+
None,
2521+
ts,
2522+
)
2523+
.await
2524+
.unwrap()
2525+
.expect("OOM metric must be emitted at least once");
2526+
2527+
let sketch = entry.value.get_sketch().expect("distribution sketch");
2528+
let sum = sketch.sum().expect("sketch sum");
2529+
assert!(
2530+
(sum - 1.0).abs() < f64::EPSILON,
2531+
"OOM sum must be 1.0 (deduped), got {sum}"
2532+
);
2533+
2534+
// And the context flag should now reflect that we emitted.
2535+
assert!(
2536+
p.context_buffer
2537+
.get(&request_id)
2538+
.expect("context")
2539+
.oom_emitted,
2540+
"oom_emitted flag must be set after the first emission"
2541+
);
2542+
}
2543+
2544+
/// OOM signals for different `request_id`s each emit a metric — dedup is scoped
2545+
/// per request, not globally.
2546+
#[tokio::test]
2547+
async fn test_try_increment_oom_metric_distinct_request_ids_emit_separately() {
2548+
let mut p = setup();
2549+
let req1 = String::from("req-a");
2550+
let req2 = String::from("req-b");
2551+
p.context_buffer.start_context(&req1, Span::default());
2552+
p.context_buffer.start_context(&req2, Span::default());
2553+
2554+
let now: i64 = std::time::UNIX_EPOCH
2555+
.elapsed()
2556+
.expect("clock")
2557+
.as_secs()
2558+
.try_into()
2559+
.unwrap_or_default();
2560+
2561+
p.on_out_of_memory_error(Some(&req1), now);
2562+
p.on_out_of_memory_error(Some(&req2), now);
2563+
2564+
let ts = (now / 10) * 10;
2565+
let entry = p
2566+
.enhanced_metrics
2567+
.aggr_handle
2568+
.get_entry_by_id(
2569+
crate::metrics::enhanced::constants::OUT_OF_MEMORY_METRIC.into(),
2570+
None,
2571+
ts,
2572+
)
2573+
.await
2574+
.unwrap()
2575+
.expect("OOM metric must be emitted");
2576+
2577+
let sketch = entry.value.get_sketch().expect("distribution sketch");
2578+
let sum = sketch.sum().expect("sketch sum");
2579+
assert!(
2580+
(sum - 2.0).abs() < f64::EPSILON,
2581+
"OOM sum must be 2.0 (one per request_id), got {sum}"
2582+
);
2583+
}
2584+
2585+
/// In `handle_ondemand_report`, when `max_memory_used_mb == memory_size_mb`,
2586+
/// the OOM metric should be incremented exactly once for that invocation.
2587+
#[tokio::test]
2588+
async fn test_handle_ondemand_report_emits_oom_on_memory_equality() {
2589+
let mut p = setup();
2590+
let request_id = String::from("req-eq");
2591+
p.context_buffer.start_context(&request_id, Span::default());
2592+
2593+
let now: i64 = std::time::UNIX_EPOCH
2594+
.elapsed()
2595+
.expect("clock")
2596+
.as_secs()
2597+
.try_into()
2598+
.unwrap_or_default();
2599+
2600+
let metrics = OnDemandReportMetrics {
2601+
duration_ms: 100.0,
2602+
billed_duration_ms: 100,
2603+
memory_size_mb: 1024,
2604+
max_memory_used_mb: 1024,
2605+
init_duration_ms: None,
2606+
restore_duration_ms: None,
2607+
};
2608+
p.handle_ondemand_report(&request_id, metrics, now);
2609+
2610+
let ts = (now / 10) * 10;
2611+
assert!(
2612+
p.enhanced_metrics
2613+
.aggr_handle
2614+
.get_entry_by_id(
2615+
crate::metrics::enhanced::constants::OUT_OF_MEMORY_METRIC.into(),
2616+
None,
2617+
ts
2618+
)
2619+
.await
2620+
.unwrap()
2621+
.is_some(),
2622+
"OOM must be emitted when max_memory_used_mb == memory_size_mb"
2623+
);
2624+
}
24482625
}

bottlecap/src/lifecycle/invocation/processor_service.rs

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ pub enum ProcessorCommand {
118118
execution_status: Option<String>,
119119
},
120120
OnOutOfMemoryError {
121+
request_id: Option<String>,
121122
timestamp: i64,
122123
},
123124
OnShutdownEvent,
@@ -407,10 +408,14 @@ impl InvocationProcessorHandle {
407408

408409
pub async fn on_out_of_memory_error(
409410
&self,
411+
request_id: Option<String>,
410412
timestamp: i64,
411413
) -> Result<(), mpsc::error::SendError<ProcessorCommand>> {
412414
self.sender
413-
.send(ProcessorCommand::OnOutOfMemoryError { timestamp })
415+
.send(ProcessorCommand::OnOutOfMemoryError {
416+
request_id,
417+
timestamp,
418+
})
414419
.await
415420
}
416421

@@ -632,8 +637,12 @@ impl InvocationProcessorService {
632637
)
633638
.await;
634639
}
635-
ProcessorCommand::OnOutOfMemoryError { timestamp } => {
636-
self.processor.on_out_of_memory_error(timestamp);
640+
ProcessorCommand::OnOutOfMemoryError {
641+
request_id,
642+
timestamp,
643+
} => {
644+
self.processor
645+
.on_out_of_memory_error(request_id.as_ref(), timestamp);
637646
}
638647
ProcessorCommand::OnShutdownEvent => {
639648
self.processor.on_shutdown_event();

0 commit comments

Comments
 (0)