Utility suggestion: dependency boundary

I'm not sure whether this should be here for i6_core (although I think this would be useful for all users of i6_core), or even Sisyphus itself (it probably is useful for all Sisyphus users), or i6_experiments/common. If you think this should be somewhere else, we can move the issue.

The main reason is to have a workaround to the problem that Sisyphus is slow for big pipelines (https://github.com/rwth-i6/sisyphus/issues/90). (If Sisyphus would not be slow, we would probably not need it, and every single dependency can always be part of the graph.)

So, the basic idea is, when doing some neural model training experiments, some commonly reused parts of the pipeline (e.g. data preprocessing, feature extraction, alignment generation, CART, whatever) are not part of the Sisyphus graph but you have done that in a separate Sisyphus pipeline and now you directly use the generated outputs.

Here in this issue I want to propose a more systematic approach for this which makes this more seamless. Esp considering the main intention of our recipes that it is simple to reproduce some results, both for ourselves and for outsiders, when they want to reproduce some result from our papers.

The idea is that it should still be possible to run the whole pipeline and that the user does not need to run separate parts of the pipeline separately.

But there are some open questions or details to be filled in, so this is now open for discussion.

---

So now to the high-level proposal; but as said, the exact API or other aspects this are up for discussion.

Look at the hybrid NN-HMM ASR pipeline as an example, which depends on the GMM-HMM pipeline. In between, you would get objects like `RasrInitArgs`, `ReturnnRasrDataInput`, `HybridArgs`, etc. Let's say you collect all the dependencies you need to train the NN in some object, like:
```python
hybrid_nn_deps = get_all_hybrid_nn_deps()

nn_training_result = nn_training(hybrid_nn_deps)
```

So, the dependency boundary could be defined at the `hybrid_nn_deps` object.

Technically, it means, for all `tk.Path` objects somewhere in `hybrid_nn_deps`, we would replace the `creator` by some dummy which keeps the same hash as before, or just use `hash_overwrite`.

How would the API look like? We want to avoid that `get_all_hybrid_nn_deps` is called because calling it would be slow. So, it would look sth like:
```python
hybrid_nn_deps = dependency_boundary(
  func=get_all_hybrid_nn_deps,
  ...
)
```

Now, the question is, what else should there be, and how should we implement it exactly. In principle, I think everything else could be optional and automatic. But let's go through it. First, on the technical questions:

- How should we store the object? Just pickling? Or some Python code representation? This would also include the hashes.
- Where should we store the object?
- What name should the file have? This could be explicit by the user, or maybe automatic by the function name (`__qualname__` or so). For the automatic case, how exactly?

I assume then the logic is quite straightforward:

- Check if the cached object file exists, and if so, use it. Maybe make an extra check if all its dependencies (`tk.Path`) exists and error if sth is missing.
- If cached object file does not exist:
  - Run the passed `func`.
  - In case all dependencies are there, we can create the cached object file.
    (We should wait until we have this because otherwise jobs might update their dependencies and I think the hash might change then? In any case, it feels saver.)

For the user, there are some potential actions we should implement:

- Sanity check on the cached object. E.g. there might have been some changes in the meantime to the `get_all_hybrid_nn_deps` function and you want to check whether the hashes in the cached object are still correct. So basically you explicitly want to execute the whole pipeline code and any cached objects should only be used for double checking.

I'm not sure how this action or behavior would be controlled. It could be some global setting (related: https://github.com/rwth-i6/sisyphus/issues/82) or maybe some OS environment variable.

The proposal is also compatible with `tk.import_work_directory`. When executing the pipeline config and the outputs do not exist (and neither do the cached objects), it would simply execute the whole pipeline.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Utility suggestion: dependency boundary #78

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Utility suggestion: dependency boundary #78

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions