fix: prevent S3 path conflicts using tempfile#569
Merged
Conversation
Contributor
|
Does this fix the file not found errors we sometimes see as an S3 source? |
933316b to
7dca020
Compare
7dca020 to
eedf93c
Compare
PastelStorm
reviewed
Jul 31, 2025
Comment on lines
+179
to
+181
| expected_filenames.sort() | ||
| actual_filenames.sort() | ||
| assert expected_filenames == actual_filenames, ( |
There was a problem hiding this comment.
It's not super important here and shouldn't be a blocker but in general I would avoid this pattern in the code.
I did some math and you should get about 10x-12x speedup if you create and compare two sets because TimSort has O(n*log n) complexity. Comparing two sets or two lists is the same O(n).
For 100k files this would be a difference of 3.5kk operations (sorted lists) vs. 300k operations (sets)
expected_filenames = {Path(s3_key).name for s3_key in s3_keys}
actual_filenames = {Path(download_file).name for download_file in download_files}
assert expected_filenames == actual_filenames
PastelStorm
reviewed
Jul 31, 2025
Comment on lines
+275
to
+280
| if not file_data.source_identifiers: | ||
| return None | ||
|
|
||
| filename = file_data.source_identifiers.filename | ||
| if not filename: | ||
| return None |
There was a problem hiding this comment.
define both booleans as variables, join them with an and and return None once
PastelStorm
reviewed
Jul 31, 2025
| mkdir_concurrent_safe(self.download_dir) | ||
|
|
||
| temp_dir = tempfile.mkdtemp( | ||
| prefix="unstructured_", |
There was a problem hiding this comment.
I'd make this a class-level constant
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
S3 downloads were sometimes failing with
NotADirectoryErrorandFileExistsErrorwhen S3 buckets contained objects with conflicting naming patterns that cannot be represented in traditional filesystem hierarchies.Example conflict:
foo(file)foo/documents(file requiring foo to be a directory)This created a race condition where download order determined success/failure
Solution
Used tempfile to create unique download paths for each S3 object:
Before:
After:
Future Work
This PR targets only the s3 downloads. I think it would make sense to use tempfiles for all downloads (as in PR #571), but that requires more extensive changes to implement cleanly. This fix provides immediate relief from the path conflict issues while we work on the more comprehensive tempfile solution.