Skip to content

Commit 7bc7e7c

Browse files
fix: handle_end_invocation race condition when extracting payload (#1024)
## Overview 1. The first thing [handle_end_invocation](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/listener.rs#L163) does is spawn a task, let's call it `anonymousTask` 2. `handle_end_invocation` then [immediately returns 200](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/listener.rs#L181) so the tracer can continue 3. `anonymousTask` is busy with a complex body in `extract_request_body` for the time being 4. Because the tracer has continued, eventually `PlatformRuntimeDone` [is processed](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/invocation/processor_service.rs#L481) 5. Given our customer is not managed (`initialization_type: SnapStart`) then `PlatformRuntimeDone` [tries to pair_platform_runtime_done_event](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/invocation/processor.rs#L496) which is `None` because `anonymousTask` is still busy with the body 6. We then [jump to process_on_platform_runtime_done](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/invocation/processor.rs#L506) 7. Span and trace ids [are not there yet](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/invocation/processor.rs#L517), and they are never checked again after this 8. `anonymousTask` finally completes, but that's irrelevant because `send_ctx_spans` is only run on `PlatformRuntimeDone` which assumes [universal_instrumentation_end has already been sent](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/listener.rs#L177) ## Why this looks likely In the customer's logs we can see * `05:11:48.463` `datadog.trace.agent.core.DDSpan - Finished span (WRITTEN): DDSpan [ t_id=2742542901019652192` * `05:11:48.489` PlatformRuntimeDone received * `05:11:48.630` REPORT RequestId `1db22159-7200-43c8-bec1-11b89df4f099` (last log emitted in an execution) * `05:11:53.784` START RequestId: `8c801767-e21b-43f7-bd11-078bb64bc430` (new request id, 5s later) * `05:11:53.789` [Received end invocation request from headers:{""x-datadog-trace-id"": ""2742542901019652192"...](https://github.com/DataDog/datadog-lambda-extension/blob/2c61aca453a3ab41df58d407e1659705cdcee273/bottlecap/src/lifecycle/listener.rs#L258) -> we are now trying to finish the span after the request is long gone 🙃 In this specific run, the lambda even had time to stop before continuin with the anonymous task from `handle_end_invocation`. ## Motivation [SLES-2666](https://datadoghq.atlassian.net/browse/SLES-2666) ## Performance This PR makes the reading of the body synchronous with the response. This will delay handing over execution to outside the extension until the body is read. But that is irrelevant because it is a requirement to read the body and send `universal_instrumentation_end` before relinquishing control. ## Testing Suggestions very welcome [SLES-2666]: https://datadoghq.atlassian.net/browse/SLES-2666?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Jordan Gonzalez <30836115+duncanista@users.noreply.github.com>
1 parent cd9d1bb commit 7bc7e7c

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

bottlecap/src/lifecycle/listener.rs

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -168,16 +168,19 @@ impl Listener {
168168
State((invocation_processor_handle, _, tasks)): State<ListenerState>,
169169
request: Request,
170170
) -> Response {
171+
// IMPORTANT: Extract the body synchronously before returning the response.
172+
// If this is moved into the spawned task, PlatformRuntimeDone may be
173+
// processed before the body is read, causing orphaned traces. (SLES-2666)
174+
let (parts, body) = match extract_request_body(request).await {
175+
Ok(r) => r,
176+
Err(e) => {
177+
error!("Failed to extract request body: {e}");
178+
return (StatusCode::OK, json!({}).to_string()).into_response();
179+
}
180+
};
181+
171182
let mut join_set = tasks.lock().await;
172183
join_set.spawn(async move {
173-
let (parts, body) = match extract_request_body(request).await {
174-
Ok(r) => r,
175-
Err(e) => {
176-
error!("Failed to extract request body: {e}");
177-
return;
178-
}
179-
};
180-
181184
Self::universal_instrumentation_end(&parts.headers, body, invocation_processor_handle)
182185
.await;
183186
});

0 commit comments

Comments
 (0)