Skip to content

S3Downloader cache races cause corrupt downloads under concurrent pipeline runs #3294

@medsriha

Description

@medsriha

Describe the bug

When multiple pipeline runs execute in parallel, S3Downloader corrupts files in its on-disk cache, causing downstream components like DocumentToImageContent to fail with errors such as PDFium: Data format error, Failed to load page, or KeyError on a missing page.

The component shares one cache directory (file_root_path) across all concurrent run() calls, and two races compound - could be more:

  1. _download_file writes straight to the final cache path, so a concurrent reader can observe a partial PDF mid-write
  2. _cleanup_cache only protects files referenced by the current run, so it can delete files another in-flight run is about to read

To Reproduce

  1. Wire S3Downloader → DocumentToImageContent in a pipeline.
  2. Call pipeline.run() from multiple threads simultaneously, requesting documents whose total file count exceeds max_cache_size default 100. This forces cache eviction while runs are in flight.
  3. Observe intermittent pdfium failures on DocumentToImageContent. Failure rate scales with parallelism.

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions