I'm not sure whether this should be here for i6_core (although I think this would be useful for all users of i6_core), or even Sisyphus itself (it probably is useful for all Sisyphus users), or i6_experiments/common. If you think this should be somewhere else, we can move the issue.
The main reason is to have a workaround to the problem that Sisyphus is slow for big pipelines (rwth-i6/sisyphus#90). (If Sisyphus would not be slow, we would probably not need it, and every single dependency can always be part of the graph.)
So, the basic idea is, when doing some neural model training experiments, some commonly reused parts of the pipeline (e.g. data preprocessing, feature extraction, alignment generation, CART, whatever) are not part of the Sisyphus graph but you have done that in a separate Sisyphus pipeline and now you directly use the generated outputs.
Here in this issue I want to propose a more systematic approach for this which makes this more seamless. Esp considering the main intention of our recipes that it is simple to reproduce some results, both for ourselves and for outsiders, when they want to reproduce some result from our papers.
The idea is that it should still be possible to run the whole pipeline and that the user does not need to run separate parts of the pipeline separately.
But there are some open questions or details to be filled in, so this is now open for discussion.
So now to the high-level proposal; but as said, the exact API or other aspects this are up for discussion.
Look at the hybrid NN-HMM ASR pipeline as an example, which depends on the GMM-HMM pipeline. In between, you would get objects like RasrInitArgs, ReturnnRasrDataInput, HybridArgs, etc. Let's say you collect all the dependencies you need to train the NN in some object, like:
hybrid_nn_deps = get_all_hybrid_nn_deps()
nn_training_result = nn_training(hybrid_nn_deps)
So, the dependency boundary could be defined at the hybrid_nn_deps object.
Technically, it means, for all tk.Path objects somewhere in hybrid_nn_deps, we would replace the creator by some dummy which keeps the same hash as before, or just use hash_overwrite.
How would the API look like? We want to avoid that get_all_hybrid_nn_deps is called because calling it would be slow. So, it would look sth like:
hybrid_nn_deps = dependency_boundary(
func=get_all_hybrid_nn_deps,
...
)
Now, the question is, what else should there be, and how should we implement it exactly. In principle, I think everything else could be optional and automatic. But let's go through it. First, on the technical questions:
- How should we store the object? Just pickling? Or some Python code representation? This would also include the hashes.
- Where should we store the object?
- What name should the file have? This could be explicit by the user, or maybe automatic by the function name (
__qualname__ or so). For the automatic case, how exactly?
I assume then the logic is quite straightforward:
- Check if the cached object file exists, and if so, use it. Maybe make an extra check if all its dependencies (
tk.Path) exists and error if sth is missing.
- If cached object file does not exist:
- Run the passed
func.
- In case all dependencies are there, we can create the cached object file.
(We should wait until we have this because otherwise jobs might update their dependencies and I think the hash might change then? In any case, it feels saver.)
For the user, there are some potential actions we should implement:
- Sanity check on the cached object. E.g. there might have been some changes in the meantime to the
get_all_hybrid_nn_deps function and you want to check whether the hashes in the cached object are still correct. So basically you explicitly want to execute the whole pipeline code and any cached objects should only be used for double checking.
I'm not sure how this action or behavior would be controlled. It could be some global setting (related: rwth-i6/sisyphus#82) or maybe some OS environment variable.
The proposal is also compatible with tk.import_work_directory. When executing the pipeline config and the outputs do not exist (and neither do the cached objects), it would simply execute the whole pipeline.
I'm not sure whether this should be here for i6_core (although I think this would be useful for all users of i6_core), or even Sisyphus itself (it probably is useful for all Sisyphus users), or i6_experiments/common. If you think this should be somewhere else, we can move the issue.
The main reason is to have a workaround to the problem that Sisyphus is slow for big pipelines (rwth-i6/sisyphus#90). (If Sisyphus would not be slow, we would probably not need it, and every single dependency can always be part of the graph.)
So, the basic idea is, when doing some neural model training experiments, some commonly reused parts of the pipeline (e.g. data preprocessing, feature extraction, alignment generation, CART, whatever) are not part of the Sisyphus graph but you have done that in a separate Sisyphus pipeline and now you directly use the generated outputs.
Here in this issue I want to propose a more systematic approach for this which makes this more seamless. Esp considering the main intention of our recipes that it is simple to reproduce some results, both for ourselves and for outsiders, when they want to reproduce some result from our papers.
The idea is that it should still be possible to run the whole pipeline and that the user does not need to run separate parts of the pipeline separately.
But there are some open questions or details to be filled in, so this is now open for discussion.
So now to the high-level proposal; but as said, the exact API or other aspects this are up for discussion.
Look at the hybrid NN-HMM ASR pipeline as an example, which depends on the GMM-HMM pipeline. In between, you would get objects like
RasrInitArgs,ReturnnRasrDataInput,HybridArgs, etc. Let's say you collect all the dependencies you need to train the NN in some object, like:So, the dependency boundary could be defined at the
hybrid_nn_depsobject.Technically, it means, for all
tk.Pathobjects somewhere inhybrid_nn_deps, we would replace thecreatorby some dummy which keeps the same hash as before, or just usehash_overwrite.How would the API look like? We want to avoid that
get_all_hybrid_nn_depsis called because calling it would be slow. So, it would look sth like:Now, the question is, what else should there be, and how should we implement it exactly. In principle, I think everything else could be optional and automatic. But let's go through it. First, on the technical questions:
__qualname__or so). For the automatic case, how exactly?I assume then the logic is quite straightforward:
tk.Path) exists and error if sth is missing.func.(We should wait until we have this because otherwise jobs might update their dependencies and I think the hash might change then? In any case, it feels saver.)
For the user, there are some potential actions we should implement:
get_all_hybrid_nn_depsfunction and you want to check whether the hashes in the cached object are still correct. So basically you explicitly want to execute the whole pipeline code and any cached objects should only be used for double checking.I'm not sure how this action or behavior would be controlled. It could be some global setting (related: rwth-i6/sisyphus#82) or maybe some OS environment variable.
The proposal is also compatible with
tk.import_work_directory. When executing the pipeline config and the outputs do not exist (and neither do the cached objects), it would simply execute the whole pipeline.