fix(otlp-http): limit response body size to prevent memory exhaustion#3501
fix(otlp-http): limit response body size to prevent memory exhaustion#3501SATVIKsynopsis wants to merge 19 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3501 +/- ##
=====================================
Coverage 82.9% 82.9%
=====================================
Files 130 130
Lines 27484 27523 +39
=====================================
+ Hits 22800 22839 +39
Misses 4684 4684 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
The current fix checks the size after the full body is already buffered. To truly prevent the allocation, the body needs to be read in chunks and aborted early as suggested earlier as well. I have an approach using |
Yes, that approach sounds right. The key requirement is that the limit is enforced while the body is being read, before it becomes Can we try to accommodate this in next release - #3508 - @cijothomas if it can be fixed soon |
|
Updated the fix to use chunked reading in all three HTTP client implementations as in reqwest async, reqwest blocking and hyper so the limit is enforced while the body is being read before it becomes |
|
I have added the |
|
Hi, just checking in. let me know if any further changes are needed or if there's anything blocking the review. Happy to address any feedback. |
lalitb
left a comment
There was a problem hiding this comment.
LGTM. Since this is a behavior change in the built-in opentelemetry-http clients, we should add a short vNext changelog entry for opentelemetry-http before merging/releasing.
Thanks. will add a short vNext changelog entry for before merging/releasing. |
|
I have added the vNext changelog. |
Is that included in the PR? I am not seeing tests added.? |
| let status = response.status(); | ||
| let headers = std::mem::take(response.headers_mut()); | ||
|
|
||
| let mut body_bytes = bytes::BytesMut::new(); |
There was a problem hiding this comment.
could we start with some capacity other than 0 to avoid reallocation?
There was a problem hiding this comment.
Good point. Let me check whether we can leverage the response body size hint or content length to preallocate an appropriate capacity rather than starting from 0.
There was a problem hiding this comment.
i checked both implementations and was able to leverage the response size information for pre allocation and in both cases the capacity is capped at MAX_RESPONSE_BODY_BYTES before allocating.
There was a problem hiding this comment.
let mut body_bytes = bytes::BytesMut::new();
I don't see pre-allocation still?
There was a problem hiding this comment.
you're right, that's still showing the previous version. the pre allocation change is currently only in my local branch using the response size hint/content length); I'll push it shortly.
I had added a test earlier in |
| .take(MAX_RESPONSE_BODY_BYTES as u64 + 1) | ||
| .read_to_end(&mut body_bytes)?; | ||
| if body_bytes.len() > MAX_RESPONSE_BODY_BYTES { | ||
| return Err("response body too large".into()); |
There was a problem hiding this comment.
I don't think this is correct. If we return Err from here, I think this would cause OTLP Exporter to consider the request as failed, and likely initiate retry waves.
If the response is too big, we don't need to retry (we shouldn't).
There was a problem hiding this comment.
good point. returning an Err here would indeed make the exporter treat it as a request failure and potentially retry so then i will take a look at how the response body is consumed by the oltp exporter and adjust the handling so oversized responses don't trigger retries.
cijothomas
left a comment
There was a problem hiding this comment.
Thanks for working on this. We need to do a more holistic approach to handle this - when response size is bigger, this PR returns Err and the OTLP Exporter would retry.
We should prevent that. Most likely, this'd mean conditionally reading the response body (we only need the body for partial response case, and then for loggging purposes.)
(Also - small note on pre-allocating the capacity for reading response).
|
Thanks for the feedback. looking through the oltp http flow, successful response bodies appear to be consumed primarily for partial success parsing so for oversized responses, would you prefer that we treat them similarly to an empty response body like skip partial success parsing and continue without retrying or would you rather move/condition the response body reading so it's only performed when needed for partial success or logging scenarios? |
|
I have added the pre-allocation step in all three, added the http axum mock server, and the oltp exporter retry fix. All tests pass now. happy to address any feedback. |
| let mut response = self.execute(request)?.error_for_status()?; | ||
| let status = response.status(); | ||
| let headers = std::mem::take(response.headers_mut()); | ||
| let mut body_bytes = Vec::new(); |
There was a problem hiding this comment.
this can also be pre-allocated right?
| tokio = { workspace = true, features = ["time"], optional = true } | ||
|
|
||
| [dev-dependencies] | ||
| axum = { version = "0.8" } |
There was a problem hiding this comment.
I understand its dev-dev only, but wondering if we can avoid this ?
There was a problem hiding this comment.
I'll switch to a minimal raw TCP test server instead so we don't need to add axum.
|
I have added the pre-allocation on |
Fixes #3439
Changes
I traced all HTTP response handling to export_http_once in
opentelemetry-oltp/src/exporter/http/mod.rsand since all three signal exporters funnel through this one function, fixing it here covered everything. I also added aMAX_RESPONSE_BODY_BYTES (10 mb)cap on 2xx response bodies and stopped reading non-2xx bodies entirely. I also added a test with a mock client that returns an oversized body to verify the cap works.Merge requirement checklist
CHANGELOG.mdfiles updated for non-trivial, user-facing changes