fix(lifecycle): release invocation context after platform report to prevent memory leak#1050
Conversation
da979a7 to
53cbf0b
Compare
There was a problem hiding this comment.
Thanks for the fix!
I'm worried that sometimes telemetry events can arrive out of order, e.g. PlatformReport arrives before other events such as PlatformRuntimeDone. To capture this, can we add some log when context_buffer.get(request_id) returns None?
| debug!( | ||
| "Context released (buffer size after remove: {})", | ||
| self.context_buffer.size() | ||
| ); |
There was a problem hiding this comment.
Do we expect this to be frequently used for debugging? If no, maybe change it to "trace" level.
https://github.com/DataDog/datadog-lambda-extension/blob/main/bottlecap/src/lifecycle/invocation/context.rs#L202 will log in case the request id could not be found. |
53cbf0b to
d0b4b3d
Compare
…revent memory leak Contexts added to ContextBuffer on every invocation were never removed after processing, causing unbounded memory growth across warm invocations. The growth was most visible when DD_CAPTURE_LAMBDA_PAYLOAD=true with large response payloads (issue #1049), but affects all invocations. Remove the context at the end of on_platform_report, which is the last point in the lifecycle where downstream code still reads context fields (runtime_duration_ms for post-runtime metrics, enhanced_metric_data for network/CPU metrics). Both on-demand and managed instance paths are fixed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I remembered we remove this because at some point, we could have multiple invocations whose data was not yet available for us to process/flush and we were receiving the report and doing what this PR does, might be good to check what was the real reason and how can we avoid this |
d0b4b3d to
03f60b4
Compare
|
This PR introduced increasing the buffer size: #801 Might be good digging in the code when we removed this code that we are adding back on this PR |
Found them: I think my change is safe: by the time on_platform_report is called, the cold start span has already been sent in on_platform_runtime_done. The context is no longer needed for that purpose. remove() at the end just cleans up after all legitimate reads are complete. Let me know if I missed something. |
|
@litianningdatadog thanks for finding them! I think we might need to run this by doing a load test with hundred thousands of invocations and see if we are missing any data - I'm worried of a Lambda with high throughput (i.e. a Go service) which might generate more data without allowing shippings to happen as we expect and we might be losing data. Happy to reintroduce if we confirm that everything is in place – mainly because this is one of those tricky removals |
|
Stress testing shows no sign of missed requests. |
JIRA: https://datadoghq.atlassian.net/browse/SVLS-8625
Git issue reported: #1049
What does this PR do?
Fixes a memory leak in the extension where
ContextBufferaccumulatedContextobjects indefinitely across warm Lambda invocations.Root cause:
on_platform_report()reads theContextfor runtime duration and CPU/network enhanced metrics, but never calledcontext_buffer.remove()afterwards. Each warm invocation appended a newContextto the buffer; none were ever freed.Fix: Call
self.context_buffer.remove(request_id)at the end ofon_platform_report(), after all processing for the invocation is complete.Impact: Most visible when
DD_CAPTURE_LAMBDA_PAYLOAD=true— each retainedContextholds the full captured request/response payload as span metadata (~500 KB per invocation). Without the fix, a Lambda function invoked thousands of times would accumulate hundreds of MB of leaked memory in the extension process.Testing
buffer size after remove: 52buffer size after remove: 1