Skip to content

Interrupt and resume merge_datasets #784

@aRI0U

Description

@aRI0U

🚀 Feature

Being able to interrupt and resume the merge_datasets process

Motivation

When dealing with very large datasets, one cannot run ld.optimize on a single machine at once. Then, the workflow is:

split dataset in pieces -> ld.optimize on every piece (eventually in parallel) -> merge_datasets to reconstruct full dataset

merge_datasets can take a significant amount of time and sometimes crashes (interrupted connection, dead workers). In that situation, it would be great that re-executing merge_datasets doesn't crash but resume the merging operation instead. Right now, I have to delete the written partially merged folder and restart from the beginning.

Pitch

When merge_datasetscrashes and i execute it again, instead of failing it should scan the already created folder and resume the merging operation

Alternatives

In the meantime, would it be viable to "recursively" call merge_datasets? Let's say my dataset is split into 20 parts, calling merge_datasets separately on parts 0-4, 5-9, 10-14, 15-19, and then call again merge_datasets on the resulting folders? Would it be equivalent to call merge_datasets only once on everything?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions