Skip to content

feat: cap fetched body size in LinkContentFetcher#11228

Open
etairl wants to merge 2 commits intodeepset-ai:mainfrom
etairl:limit-link-content-fetcher-response-size
Open

feat: cap fetched body size in LinkContentFetcher#11228
etairl wants to merge 2 commits intodeepset-ai:mainfrom
etairl:limit-link-content-fetcher-response-size

Conversation

@etairl
Copy link
Copy Markdown

@etairl etairl commented Apr 30, 2026

Summary

  • LinkContentFetcher previously read full response bodies into memory via response.text / response.content with no upper bound. A remote that returns an unexpectedly large body therefore caused a proportional Python allocation.
  • Adds a max_response_size constructor parameter (default 10 MiB).
  • Both the sync and async fetch paths now use httpx.Client.stream / AsyncClient.stream and abort the request when:
    • the advertised Content-Length exceeds the cap, or
    • more than the cap's worth of bytes have been read from the body.
  • The captured bytes are written back onto the httpx.Response object so existing handlers (response.text, response.content) keep working with no further changes.
  • Pass max_response_size=None to restore the previous unbounded behavior.

Test plan

  • CI runs LinkContentFetcher unit tests against the streamed sync and async paths.
  • Manual: fetch a known small URL — body is returned identically to before.
  • Manual: point at a server that returns a body larger than max_response_size — request is aborted with an httpx.RequestError instead of materializing the body in memory.
  • Manual: configure max_response_size=None — large bodies are read as before.

🤖 Generated with Claude Code

Both the sync and async fetch paths read entire response bodies into
memory via response.text / response.content with no upper bound. A
remote that returns or streams an unexpectedly large body therefore
forces a proportional Python allocation, which can pressure or exhaust
the worker's memory.

Add a max_response_size constructor parameter (default 10 MiB) and
switch both fetch paths to httpx.Client.stream / httpx.AsyncClient.stream
so the connection is torn down as soon as the cap is reached. The
captured bytes are stashed back on the response object so existing
content handlers (text and binary) keep working unchanged. Pass
max_response_size=None to restore the previous unbounded behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@etairl etairl requested a review from a team as a code owner April 30, 2026 10:22
@etairl etairl requested review from sjrl and removed request for a team April 30, 2026 10:22
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 30, 2026

@etairl is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added the type:documentation Improvements on the docs label Apr 30, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 30, 2026

CLA assistant check
All committers have signed the CLA.

@anakin87
Copy link
Copy Markdown
Member

See #11226 (review)

Release-note linter rejects single backticks for inline code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants