Skip to content

fix(lifecycle): release invocation context after platform report to prevent memory leak#1050

Merged
litianningdatadog merged 1 commit intomainfrom
tianning.li/SVLS-8625
Mar 5, 2026
Merged

fix(lifecycle): release invocation context after platform report to prevent memory leak#1050
litianningdatadog merged 1 commit intomainfrom
tianning.li/SVLS-8625

Conversation

@litianningdatadog
Copy link
Copy Markdown
Contributor

@litianningdatadog litianningdatadog commented Mar 2, 2026

JIRA: https://datadoghq.atlassian.net/browse/SVLS-8625
Git issue reported: #1049

What does this PR do?

Fixes a memory leak in the extension where ContextBuffer accumulated Context objects indefinitely across warm Lambda invocations.

Root cause: on_platform_report() reads the Context for runtime duration and CPU/network enhanced metrics, but never called context_buffer.remove() afterwards. Each warm invocation appended a new Context to the buffer; none were ever freed.

Fix: Call self.context_buffer.remove(request_id) at the end of on_platform_report(), after all processing for the invocation is complete.

Impact: Most visible when DD_CAPTURE_LAMBDA_PAYLOAD=true — each retained Context holds the full captured request/response payload as span metadata (~500 KB per invocation). Without the fix, a Lambda function invoked thousands of times would accumulate hundreds of MB of leaked memory in the extension process.

Testing

  • Unit tests
  • A test with 50+ invocations shows
without the fix with the fix
buffer size after remove: 52 buffer size after remove: 1

@litianningdatadog litianningdatadog requested a review from a team as a code owner March 2, 2026 02:25
@litianningdatadog litianningdatadog force-pushed the tianning.li/SVLS-8625 branch 4 times, most recently from da979a7 to 53cbf0b Compare March 2, 2026 17:31
Copy link
Copy Markdown
Contributor

@lym953 lym953 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!
I'm worried that sometimes telemetry events can arrive out of order, e.g. PlatformReport arrives before other events such as PlatformRuntimeDone. To capture this, can we add some log when context_buffer.get(request_id) returns None?

Comment on lines +778 to +781
debug!(
"Context released (buffer size after remove: {})",
self.context_buffer.size()
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this to be frequently used for debugging? If no, maybe change it to "trace" level.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@litianningdatadog
Copy link
Copy Markdown
Contributor Author

Thanks for the fix! I'm worried that sometimes telemetry events can arrive out of order, e.g. PlatformReport arrives before other events such as PlatformRuntimeDone. To capture this, can we add some log when context_buffer.get(request_id) returns None?

https://github.com/DataDog/datadog-lambda-extension/blob/main/bottlecap/src/lifecycle/invocation/context.rs#L202 will log in case the request id could not be found.

…revent memory leak

Contexts added to ContextBuffer on every invocation were never removed
after processing, causing unbounded memory growth across warm invocations.
The growth was most visible when DD_CAPTURE_LAMBDA_PAYLOAD=true with large
response payloads (issue #1049), but affects all invocations.

Remove the context at the end of on_platform_report, which is the last
point in the lifecycle where downstream code still reads context fields
(runtime_duration_ms for post-runtime metrics, enhanced_metric_data for
network/CPU metrics). Both on-demand and managed instance paths are fixed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@duncanista
Copy link
Copy Markdown
Contributor

I remembered we remove this because at some point, we could have multiple invocations whose data was not yet available for us to process/flush and we were receiving the report and doing what this PR does, might be good to check what was the real reason and how can we avoid this

@duncanista
Copy link
Copy Markdown
Contributor

This PR introduced increasing the buffer size: #801

Might be good digging in the code when we removed this code that we are adding back on this PR

@litianningdatadog
Copy link
Copy Markdown
Contributor Author

litianningdatadog commented Mar 3, 2026

This PR introduced increasing the buffer size: #801

Might be good digging in the code when we removed this code that we are adding back on this PR

Found them:

I think my change is safe: by the time on_platform_report is called, the cold start span has already been sent in on_platform_runtime_done. The context is no longer needed for that purpose. remove() at the end just cleans up after all legitimate reads are complete. Let me know if I missed something.

@duncanista
Copy link
Copy Markdown
Contributor

@litianningdatadog thanks for finding them!

I think we might need to run this by doing a load test with hundred thousands of invocations and see if we are missing any data - I'm worried of a Lambda with high throughput (i.e. a Go service) which might generate more data without allowing shippings to happen as we expect and we might be losing data.

Happy to reintroduce if we confirm that everything is in place – mainly because this is one of those tricky removals

@litianningdatadog
Copy link
Copy Markdown
Contributor Author

Stress testing shows no sign of missed requests.

@litianningdatadog litianningdatadog merged commit db59a28 into main Mar 5, 2026
50 checks passed
@litianningdatadog litianningdatadog deleted the tianning.li/SVLS-8625 branch March 5, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants