Skip to content

fix: prevent S3 path conflicts using tempfile#569

Merged
CyMule merged 11 commits intomainfrom
fix/s3-path-conflicts-hash-isolation
Jul 31, 2025
Merged

fix: prevent S3 path conflicts using tempfile#569
CyMule merged 11 commits intomainfrom
fix/s3-path-conflicts-hash-isolation

Conversation

@CyMule
Copy link
Copy Markdown
Contributor

@CyMule CyMule commented Jul 29, 2025

Problem

S3 downloads were sometimes failing with NotADirectoryError and FileExistsError when S3 buckets contained objects with conflicting naming patterns that cannot be represented in traditional filesystem hierarchies.

Example conflict:

  • S3 object: foo (file)
  • S3 object: foo/documents (file requiring foo to be a directory)

This created a race condition where download order determined success/failure

Solution

Used tempfile to create unique download paths for each S3 object:

Before:

S3: "foo" → Local: /downloads/foo
S3: "foo/documents" → Local: /downloads/foo/documents
Conflict: foo cannot be both file and directory

After:

S3: "foo" → Local: /downloads/a1b2c3d4e5f6/foo
S3: "foo/documents" → Local: /downloads/9g8h7i6j5k4l/documents
No conflicts: Each file gets unique directory

Future Work

This PR targets only the s3 downloads. I think it would make sense to use tempfiles for all downloads (as in PR #571), but that requires more extensive changes to implement cleanly. This fix provides immediate relief from the path conflict issues while we work on the more comprehensive tempfile solution.

@cmscmadd
Copy link
Copy Markdown
Contributor

Does this fix the file not found errors we sometimes see as an S3 source?

@CyMule CyMule force-pushed the fix/s3-path-conflicts-hash-isolation branch from 933316b to 7dca020 Compare July 30, 2025 19:44
Comment on lines +179 to +181
expected_filenames.sort()
actual_filenames.sort()
assert expected_filenames == actual_filenames, (
Copy link
Copy Markdown

@PastelStorm PastelStorm Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super important here and shouldn't be a blocker but in general I would avoid this pattern in the code.

I did some math and you should get about 10x-12x speedup if you create and compare two sets because TimSort has O(n*log n) complexity. Comparing two sets or two lists is the same O(n).

For 100k files this would be a difference of 3.5kk operations (sorted lists) vs. 300k operations (sets)

expected_filenames = {Path(s3_key).name for s3_key in s3_keys}
actual_filenames = {Path(download_file).name for download_file in download_files}

assert expected_filenames == actual_filenames

Comment on lines +275 to +280
if not file_data.source_identifiers:
return None

filename = file_data.source_identifiers.filename
if not filename:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define both booleans as variables, join them with an and and return None once

mkdir_concurrent_safe(self.download_dir)

temp_dir = tempfile.mkdtemp(
prefix="unstructured_",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this a class-level constant

Copy link
Copy Markdown

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but otherwise LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants