Describe the bug
When multiple pipeline runs execute in parallel, S3Downloader corrupts files in its on-disk cache, causing downstream components like DocumentToImageContent to fail with errors such as PDFium: Data format error, Failed to load page, or KeyError on a missing page.
The component shares one cache directory (file_root_path) across all concurrent run() calls, and two races compound - could be more:
_download_file writes straight to the final cache path, so a concurrent reader can observe a partial PDF mid-write
_cleanup_cache only protects files referenced by the current run, so it can delete files another in-flight run is about to read
To Reproduce
- Wire
S3Downloader → DocumentToImageContent in a pipeline.
- Call pipeline.run() from multiple threads simultaneously, requesting documents whose total file count exceeds
max_cache_size default 100. This forces cache eviction while runs are in flight.
- Observe intermittent pdfium failures on
DocumentToImageContent. Failure rate scales with parallelism.
Describe the bug
When multiple pipeline runs execute in parallel,
S3Downloadercorrupts files in its on-disk cache, causing downstream components likeDocumentToImageContentto fail with errors such asPDFium: Data format error,Failed to load page, orKeyErroron a missing page.The component shares one cache directory (
file_root_path) across all concurrentrun()calls, and two races compound - could be more:_download_filewrites straight to the final cache path, so a concurrent reader can observe a partial PDF mid-write_cleanup_cacheonly protects files referenced by the current run, so it can delete files another in-flight run is about to readTo Reproduce
S3Downloader → DocumentToImageContentin a pipeline.max_cache_sizedefault 100. This forces cache eviction while runs are in flight.DocumentToImageContent. Failure rate scales with parallelism.