fix: prevent S3 path conflicts using tempfile by CyMule · Pull Request #569 · Unstructured-IO/unstructured-ingest

CyMule · 2025-07-29T20:17:34Z

Problem

S3 downloads were sometimes failing with NotADirectoryError and FileExistsError when S3 buckets contained objects with conflicting naming patterns that cannot be represented in traditional filesystem hierarchies.

Example conflict:

S3 object: foo (file)
S3 object: foo/documents (file requiring foo to be a directory)

This created a race condition where download order determined success/failure

Solution

Used tempfile to create unique download paths for each S3 object:

Before:

S3: "foo" → Local: /downloads/foo
S3: "foo/documents" → Local: /downloads/foo/documents
Conflict: foo cannot be both file and directory

After:

S3: "foo" → Local: /downloads/a1b2c3d4e5f6/foo
S3: "foo/documents" → Local: /downloads/9g8h7i6j5k4l/documents
No conflicts: Each file gets unique directory

Future Work

This PR targets only the s3 downloads. I think it would make sense to use tempfiles for all downloads (as in PR #571), but that requires more extensive changes to implement cleanly. This fix provides immediate relief from the path conflict issues while we work on the more comprehensive tempfile solution.

cmscmadd · 2025-07-29T20:53:12Z

Does this fix the file not found errors we sometimes see as an S3 source?

PastelStorm · 2025-07-31T02:05:28Z

+        expected_filenames.sort()
+        actual_filenames.sort()
+        assert expected_filenames == actual_filenames, (


It's not super important here and shouldn't be a blocker but in general I would avoid this pattern in the code.

I did some math and you should get about 10x-12x speedup if you create and compare two sets because TimSort has O(n*log n) complexity. Comparing two sets or two lists is the same O(n).

For 100k files this would be a difference of 3.5kk operations (sorted lists) vs. 300k operations (sets)

expected_filenames = {Path(s3_key).name for s3_key in s3_keys} actual_filenames = {Path(download_file).name for download_file in download_files} assert expected_filenames == actual_filenames

PastelStorm · 2025-07-31T02:09:04Z

+        if not file_data.source_identifiers:
+            return None
+
+        filename = file_data.source_identifiers.filename
+        if not filename:
+            return None


define both booleans as variables, join them with an and and return None once

PastelStorm · 2025-07-31T02:13:35Z

+        mkdir_concurrent_safe(self.download_dir)
+
+        temp_dir = tempfile.mkdtemp(
+            prefix="unstructured_", 


I'd make this a class-level constant

PastelStorm

A few nits but otherwise LGTM!

CyMule temporarily deployed to ci July 29, 2025 20:17 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 29, 2025 20:28 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 15:38 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 15:39 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 16:37 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 19:26 — with GitHub Actions Inactive

CyMule force-pushed the fix/s3-path-conflicts-hash-isolation branch from 933316b to 7dca020 Compare July 30, 2025 19:44

CyMule added 5 commits July 30, 2025 15:46

fix: prevent S3 path conflicts using hash-based directory isolation

7f9e207

version

ef832cc

update test

aca99e5

cleanup

3102775

target s3

eedf93c

CyMule force-pushed the fix/s3-path-conflicts-hash-isolation branch from 7dca020 to eedf93c Compare July 30, 2025 19:47

CyMule temporarily deployed to ci July 30, 2025 19:47 — with GitHub Actions Inactive

temp files

35736b9

CyMule temporarily deployed to ci July 30, 2025 20:35 — with GitHub Actions Inactive

CyMule added 3 commits July 30, 2025 16:47

fix

966fdda

mkdir and download dir

7dd4231

remove line

bc8e2ed

PastelStorm reviewed Jul 31, 2025

View reviewed changes

PastelStorm approved these changes Jul 31, 2025

View reviewed changes

CyMule and others added 2 commits July 31, 2025 08:41

addres feedback

ed2a837

Merge branch 'main' into fix/s3-path-conflicts-hash-isolation

9d1ee57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent S3 path conflicts using tempfile#569

fix: prevent S3 path conflicts using tempfile#569
CyMule merged 11 commits intomainfrom
fix/s3-path-conflicts-hash-isolation

CyMule commented Jul 29, 2025 •

edited

Loading

Uh oh!

cmscmadd commented Jul 29, 2025

Uh oh!

PastelStorm Jul 31, 2025 •

edited

Loading

Uh oh!

PastelStorm Jul 31, 2025

Uh oh!

PastelStorm Jul 31, 2025

Uh oh!

PastelStorm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CyMule commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Future Work

Uh oh!

cmscmadd commented Jul 29, 2025

Uh oh!

PastelStorm Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PastelStorm Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

PastelStorm Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

PastelStorm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CyMule commented Jul 29, 2025 •

edited

Loading

PastelStorm Jul 31, 2025 •

edited

Loading