Skip to content

Fix: S3Downloader concurrency races in cache writing and eviction#3301

Closed
ritikraj2425 wants to merge 1 commit into
deepset-ai:mainfrom
ritikraj2425:fix-s3-downloader-concurrency
Closed

Fix: S3Downloader concurrency races in cache writing and eviction#3301
ritikraj2425 wants to merge 1 commit into
deepset-ai:mainfrom
ritikraj2425:fix-s3-downloader-concurrency

Conversation

@ritikraj2425
Copy link
Copy Markdown

Proposed Changes:

This PR fixes two critical race conditions in the S3Downloader component that occur when multiple pipeline runs execute in parallel with a shared cache directory:

  1. Partial Write Corruption: _download_file previously wrote directly to the final cache path. Concurrent readers could observe a partially written file, leading to parsing errors (e.g., PDFium: Data format error).
    • Solution: Implemented atomic writes by downloading to a unique temporary file (.tmp.<uuid>) and using pathlib.Path.replace() only once the download is complete.
  2. Premature Cache Eviction: _cleanup_cache would identify files from other active runs as "misses" and delete them if the cache was full, leading to KeyError or missing files for downstream components.
    • Solution: Updated the eviction logic to calculate overflow correctly but strictly protect any file accessed within the last 60 seconds. It also now safely ignores and cleans up stale .tmp. files older than 5 minutes.

How did you test it?

  • Unit Tests: Ran the full suite using hatch run test:unit.
  • Regression Testing: Updated test_cleanup_cache_evicts_old_files in tests/test_s3_downloader.py to verify that stale files are still evicted while recently accessed files (the new "fresh" file) are correctly protected.
  • Quality Checks: Verified all checks pass with hatch run fmt and hatch run test:types.

Notes for the reviewer

  • The 60-second time-based heuristic for cache protection was chosen as a lightweight alternative to complex file locking, effectively preventing races in concurrent pipeline executions without adding new dependencies. This approach was refined using AI assistance to balance safety and performance.

Checklist

@ritikraj2425 ritikraj2425 requested a review from a team as a code owner May 13, 2026 13:24
@ritikraj2425 ritikraj2425 requested review from bogdankostic and removed request for a team May 13, 2026 13:24
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown
Contributor

Heads-up for maintainers

This PR is from a fork and touches integrations whose integration tests require API keys.
Those tests are skipped in CI because fork PRs don't have access to repo secrets for security reasons.

Affected integrations:

  • amazon_bedrock

Please run the integration tests locally (hatch run test:integration inside each folder) before approving.

@github-actions
Copy link
Copy Markdown
Contributor

Coverage report (amazon_bedrock)

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  integrations/amazon_bedrock/src/haystack_integrations/components/downloaders/s3
  s3_downloader.py 216-219, 238-242, 258
Project Total  

This report was generated by python-coverage-comment-action

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented May 13, 2026

Hey @ritikraj2425 this is a sensitive fix that we would like to tackle internally. So thanks for the contribution but I'll close this PR

@sjrl sjrl closed this May 13, 2026
@ritikraj2425
Copy link
Copy Markdown
Author

No worries! Thanks for letting me know, @sjrl. Happy to help where I can!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S3Downloader cache races cause corrupt downloads under concurrent pipeline runs

3 participants