Interrupt and resume merge_datasets

## 🚀 Feature

Being able to interrupt and resume the `merge_datasets` process

### Motivation

When dealing with very large datasets, one cannot run ld.optimize on a single machine at once. Then, the workflow is:

split dataset in pieces -> ld.optimize on every piece (eventually in parallel) -> merge_datasets to reconstruct full dataset

merge_datasets can take a significant amount of time and sometimes crashes (interrupted connection, dead workers). In that situation, it would be great that re-executing merge_datasets doesn't crash but resume the merging operation instead. Right now, I have to delete the written partially merged folder and restart from the beginning.

### Pitch

When `merge_datasets`crashes and i execute it again, instead of failing it should scan the already created folder and resume the merging operation

### Alternatives

In the meantime, would it be viable to "recursively" call `merge_datasets`? Let's say my dataset is split into 20 parts, calling merge_datasets separately on parts 0-4, 5-9, 10-14, 15-19, and then call again merge_datasets on the resulting folders? Would it be equivalent to call merge_datasets only once on everything?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interrupt and resume merge_datasets #784

🚀 Feature

Motivation

Pitch

Alternatives

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Interrupt and resume merge_datasets #784

Description

🚀 Feature

Motivation

Pitch

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions