-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utility suggestion: dependency boundary #78
Comments
@JackTemaki @christophmluscher we had this discussion. |
Two things:
|
Having such fixed point is the same goal here. You basically just did it more manually. The main purpose is always that it is fully reproducible. If you update the preparation workflow, somewhere you should still have the old workflow anyway such that it is reproducible, or not? Well, at least for doing research and publications, this is important for us. But also, it should be simple, straightforward, unambiguous to reproduce. If this is not the case, it's not really reproducible as people could get something wrong. Ideally we could provide a hash for each single number in a table and people could double check whether they end up in the same hash.
No, I did not try. Would this recalculate the hashes for every job or use them as they are? From the profiling I did (rwth-i6/sisyphus#90), this was the main bottleneck for me (with a fast filesystem) and way too slow. If it does not recalculate the hashes, this is basically just the same as what I propose here, or not? The only difference is whether the pickled object contains the whole graph or only part of it, but this would not really matter anyway, right? It it does recalculate the hashes, it would still be way too slow. |
All hashes are stored inside the pickled object and loaded again from there. So to my knowledge it should recalculate anything, but I never profiled it to check if this is really the case.
Yes, I think this would achieve what you what to do. |
After playing around with your setup I must admit that I didn't expect you graph to be that large... even zipped I end up with a nearly 10MB large file. If you want to replace the creator with some dummy you would also have to overwrite all functions Sisyphus might want to access. I think it would be easier to replace the Overall I think this can work, but might be tricky in some parts. |
This comment was marked as duplicate.
This comment was marked as duplicate.
Then we also had some internal discussions at i6 involving @christophmluscher, @michelwi, @Marvin84, @ZhouW321, @SimBe195 about the general concept. We had many arguments back and forth about many aspects here, like: Do we need such How should the cached object file be named? As we want to have auto-generated Python code (and not just a pickle file), and it should be inside the Git repo, it would be somewhere next to the other code. The exact naming scheme still has to be decided but is probably not so important anymore. What format should we use? As said, auto-generated Python code is preferred. Where (what repo) should the On configuration (e.g. enabling sanity check), it was said that it should be an argument to |
Btw, just like |
Python representation and dump logic is needed for depdendency boundary logic (#78). Diff logic useful for that as well.
There is some initial (untested) implementation now in |
@albertz please do not push into |
We already discussed this, and the conclusion was that directly pushing to Where does it say this is stable? I can mark it more explicit as not stable, work-in-progress. |
Maybe you mean the parent In any case, it should be in |
No, this was never discussed. Yes, we did discuss this personally for the If you directly want to push and use your code, then please use your user folder for this. This is what we all agreed on when starting i6_experiments.
But what has this to do with bypassing PRs? I understand that in your view the whole process is too slow, because people will take time in looking at the PR, and during this time the code is not there. But if you directly push it, people will not look at it at all, and definitely not discuss it. So please: push your code to your user folder to directly use it (and also other people can test it from there as well), but for |
The discussion was more generic, or whatever we had as arguments was not specific to So, you want to have this further discussed? With more people?
I thought that we agreed to do it this way. It's also documented that way everywhere, e.g. see the readme in the root.
I don't really understand what is not visible. You can easily see all commits for As long as you would not actively take part in the development, you can also just check the current state if you are interested. If there is some relevant new state, I will also post here in the issue, just as I did. So far it's completely untested, so if you are not interested in testing it or looking at untested code, then you can safely ignore it for now. Once there is any update on this, I will post it here, and you can get a notification for that. If you are not interested in the development, it's probably not worth it to look at it yet, until I report here that I tested it and it looks fine.
The
Why would people not look at it or not discuss it? If you want to look at it, you can look at it (does not matter whether PR or not). If you don't want to look at it, you don't look at it. So why is it relevant that it is a PR? To get notifications when there is activity? As said, I will just notify here in this issue.
Again: Strong disagree on this. Why do you even care if you don't actively join the development? Once it is ready (considered as ready by those people who develop it), you will be notified, and then you can give feedback that something should be changed, and then we simply can change it. And once then the review is over, and maybe some others tested it as well, we can mark it as stable. And then further development would go via PRs as usual. And if you want to join the development, then even more this is an argument that we don't work in a separate branch (in a PR) because this has to be in the current master such that you can directly update some other stuff like users code etc without making it too complicated (merging or rebasing in the master changes etc). I think having PR-based development for a new feature which does not touch any other existing code and which is not used by anyone except those who develop it initially does not really offer any advantage but only lots of disadvantages (making everything much more complicated and this leads to much slower development, which is already way too slow). But if you really want to discuss this further, we should do a separate discussion on this, maybe also with other people, if anyone actually really cares about that. This discussion does not really belong here into this issue. |
The code works now for my test case on the GMM pipeline (collecting all args for the hybrid NN-HMM), specifically: orig_obj = dependency_boundary(get_orig_chris_hybrid_system_init_args, hash="YePKvlzrybBk") (In file So I think people who are interested can start reviewing the changes now. Specifically, there is Maybe we want to move |
See this example of generated code. Note there are many potential optimizations to improve the generated code further but you not sure how much worth it is to spend time on it. One thing I'm considering is to automatically apply |
I'm not sure whether this should be here for i6_core (although I think this would be useful for all users of i6_core), or even Sisyphus itself (it probably is useful for all Sisyphus users), or i6_experiments/common. If you think this should be somewhere else, we can move the issue.
The main reason is to have a workaround to the problem that Sisyphus is slow for big pipelines (rwth-i6/sisyphus#90). (If Sisyphus would not be slow, we would probably not need it, and every single dependency can always be part of the graph.)
So, the basic idea is, when doing some neural model training experiments, some commonly reused parts of the pipeline (e.g. data preprocessing, feature extraction, alignment generation, CART, whatever) are not part of the Sisyphus graph but you have done that in a separate Sisyphus pipeline and now you directly use the generated outputs.
Here in this issue I want to propose a more systematic approach for this which makes this more seamless. Esp considering the main intention of our recipes that it is simple to reproduce some results, both for ourselves and for outsiders, when they want to reproduce some result from our papers.
The idea is that it should still be possible to run the whole pipeline and that the user does not need to run separate parts of the pipeline separately.
But there are some open questions or details to be filled in, so this is now open for discussion.
So now to the high-level proposal; but as said, the exact API or other aspects this are up for discussion.
Look at the hybrid NN-HMM ASR pipeline as an example, which depends on the GMM-HMM pipeline. In between, you would get objects like
RasrInitArgs
,ReturnnRasrDataInput
,HybridArgs
, etc. Let's say you collect all the dependencies you need to train the NN in some object, like:So, the dependency boundary could be defined at the
hybrid_nn_deps
object.Technically, it means, for all
tk.Path
objects somewhere inhybrid_nn_deps
, we would replace thecreator
by some dummy which keeps the same hash as before, or just usehash_overwrite
.How would the API look like? We want to avoid that
get_all_hybrid_nn_deps
is called because calling it would be slow. So, it would look sth like:Now, the question is, what else should there be, and how should we implement it exactly. In principle, I think everything else could be optional and automatic. But let's go through it. First, on the technical questions:
__qualname__
or so). For the automatic case, how exactly?I assume then the logic is quite straightforward:
tk.Path
) exists and error if sth is missing.func
.(We should wait until we have this because otherwise jobs might update their dependencies and I think the hash might change then? In any case, it feels saver.)
For the user, there are some potential actions we should implement:
get_all_hybrid_nn_deps
function and you want to check whether the hashes in the cached object are still correct. So basically you explicitly want to execute the whole pipeline code and any cached objects should only be used for double checking.I'm not sure how this action or behavior would be controlled. It could be some global setting (related: rwth-i6/sisyphus#82) or maybe some OS environment variable.
The proposal is also compatible with
tk.import_work_directory
. When executing the pipeline config and the outputs do not exist (and neither do the cached objects), it would simply execute the whole pipeline.The text was updated successfully, but these errors were encountered: