Skip to content

Implement energon prepare with a remote dataset (using local temp)#221

Merged
philipp-fischer merged 6 commits into
developfrom
feature/remote_dataprep
May 28, 2026
Merged

Implement energon prepare with a remote dataset (using local temp)#221
philipp-fischer merged 6 commits into
developfrom
feature/remote_dataprep

Conversation

@voegtlel
Copy link
Copy Markdown
Collaborator

@voegtlel voegtlel commented Apr 8, 2026

Fixes #91

Implements energon prepare for remote (including media metadata preparation)

@voegtlel voegtlel requested a review from philipp-fischer April 8, 2026 16:01
@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 8c093f4 to 70caf01 Compare April 8, 2026 16:03
@radulescupetru
Copy link
Copy Markdown

I see ValueError: Path msc://default/path_to_s3_dataset is not local error

Preparing filesystem dataset and computing media metadata...
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/conda_env/bin/energon", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/megatron/energon/tools/prepare_media.py", line 110, in command
    stored = prepare_filesystem_dataset(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/megatron/energon/media/filesystem_prepare.py", line 44, in prepare_filesystem_dataset
    root = root_path.local_path()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/megatron/energon/epathlib/epath.py", line 258, in local_path
    raise ValueError(f"Path {self} is not local")

There's also a TODO in that method:

Only supporting local file system, because sqlite does not support remote file systems.

TODO: Implement remote file systems. Maybe create locally in tmp then upload?

@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 9905571 to 9cca0d0 Compare April 30, 2026 11:31
@voegtlel
Copy link
Copy Markdown
Collaborator Author

@radulescupetru sorry for taking a while, had other priorities. Now also implemented that for filesystem:// links. Can you try again?

@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 9cca0d0 to 6a3b40c Compare April 30, 2026 11:33
Comment thread src/megatron/energon/local_copy.py
@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 25a9d35 to 598db74 Compare May 6, 2026 12:09
Comment thread src/megatron/energon/epathlib/epath.py Outdated
Comment thread src/megatron/energon/epathlib/epath.py
Comment thread src/megatron/energon/media/filesystem_prepare.py Outdated
Comment thread src/megatron/energon/media/filesystem_prepare.py
Comment thread src/megatron/energon/media/filesystem_prepare.py Outdated
Comment thread src/megatron/energon/media/filesystem_prepare.py Outdated
Comment thread src/megatron/energon/tools/prepare_media.py
Comment thread src/megatron/energon/flavors/webdataset/prepare.py
Comment thread src/megatron/energon/flavors/webdataset/prepare.py Outdated
Comment thread src/megatron/energon/local_copy.py
…path handling. Fix S3 emulator timestamp handling
@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 582a007 to ac7dc96 Compare May 21, 2026 16:24
Comment thread src/megatron/energon/epathlib/epath.py
Comment thread src/megatron/energon/flavors/webdataset/prepare.py
@philipp-fischer philipp-fischer merged commit 4a1d360 into develop May 28, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support dataprep in object store

3 participants