You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SVLS-9175] feat: emit OOM metric on memory equality with per-request dedup (#1241)
## Background
From our knowledge (before this PR), here's the behavior when each
runtime OOMs:
- emits runtime-specific error message. This can happen on **Java**,
**Node** (case 1 in the table below) and **.NET**
- In `PlatformRuntimeDone` event, `error_type` is `Runtime.OutOfMemory`.
This can happen on **Python** and **Ruby**.
- In `PlatformReport` event, `max_memory_used == memory_size`. This can
happen on **Python**, **Ruby**, **Node** and **Go**.
To capture OOM for all these scenarios (except Node case 2, which was
just called out in #1237) without double counting, right now the
extension emits `aws.lambda.enhanced.out_of_memory` metric in these
scenarios:
- when we see runtime-specific error messages for Java, Node and .NET
- when we see `Runtime.OutOfMemory`
- when we see `max_memory_used == memory_size` for Go, i.e. only when
runtime is `provided.al2`. We don't do this for other runtimes (Python,
Ruby, Node) to avoid double counting.
<img width="768" height="323" alt="image"
src="https://github.com/user-attachments/assets/549a8820-6b86-462d-a857-0269d2990a02"
/>
## Problem
In issue #1237, a customer called out a new scenario: "Node (case 2)" in
the table. The only evidence of OOM is `max_memory_used == memory_size`,
and there is no runtime-specific log message. As a result, OOMs like
this are not captured by the OOM enhanced metric.
## This PR
- Regardless of runtime, use all the three ways to capture OOM.
- In addition, dedup by request_id to avoid double counting.
- Add one integration test per runtime (except for Node, which has 2
tests)
## Test plan
Passed the added unit tests and integration tests.
## To reviewers
Most of the code changes are for integration tests.
## Details (generated by Claude Code)
Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory
limit (`Memory Size 192 MB / Max Memory Used 192 MB`, `Status: timeout`)
did not emit `aws.lambda.enhanced.out_of_memory` because none of the
three existing detection paths matched.
- **Why the existing paths missed it.** V8 spent its budget in GC rather
than declaring `JavaScript heap out of memory`, so the runtime log-line
match never fired. The runtime crashed on a wall-clock timeout, so
`PlatformRuntimeDone` reported no `error_type`. And the
`max_memory_used_mb == memory_size_mb` check in `PlatformReport` was
gated on `runtime.starts_with("provided.al")` to avoid double-counting
against the log path, so Node was excluded.
- **What changes.** Drop the `provided.al*` restriction so the equality
check applies to every runtime. To avoid double-counting against the two
pre-existing paths (some invocations satisfy both equality and
`Runtime.OutOfMemory` simultaneously), add a per-`Context` `oom_emitted`
flag. All three detection paths funnel through a new
`Processor::try_increment_oom_metric`, which checks/sets the flag and is
a no-op on subsequent calls for the same `request_id`.
- **Plumbing.** `Event::OutOfMemory` now carries an `Option<String>
request_id`. The log-path detector reads it from
`LambdaProcessor::invocation_context.request_id` (set on
`PlatformStart`, cleared on `PlatformRuntimeDone`/`PlatformReport`).
`None` is only realistic in Managed Instance mode (extensions can't
subscribe to INVOKE there); the helper falls back to a best-effort emit
without dedup in that case.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"Invocation Processor | Emitting OOM metric without dedup: context not found for request_id {} (likely evicted from context buffer)",
1436
+
rid
1437
+
);
1438
+
}
1439
+
}else{
1440
+
debug!(
1441
+
"Invocation Processor | Emitting OOM metric without dedup: no request_id available (OOM log processed before PlatformStart or after PlatformRuntimeDone)"
0 commit comments