-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'versions' directive in process #4386
Comments
I think that this is basically a duplicate of an issue that I wrote with @pditommaso back in 2018 😬 #879 But same motivation, similar syntax even (this is slightly better though, agree that the ability to have more than 1 version per process is necessary 👍🏻 ) |
Ah bother. I did search for The biggest flaw I can see right now is stealing a secret. From the example above:
|
I like the general design, still find it horrible having to run and parse all these CLI option. would not be better to rely on the version declared by |
In my personal experience, conda only contains about 25% of tools used in bioinformatics (although they are the big and important ones). |
Yup, we've gone back and forth on this a lot over the years. At the end of the day, the only thing you can really trust is to ask the tool itself. Also many pipelines (eg. nf-core) specify both conda and docker. Both can be overwritten by the end user. So if running with some random Docker image there's really no way to know what the version is. |
Fair enough. Other points address 1) cardinality: should this be collected once per process or once per task execution; 2) how do you expect this information to be consumed? |
Yes, once per task. It's easy enough to reduce the data to per-process later. Likely is to consume it for an output file ( |
My hope is to use the new MultiQC software versions YAML files for reporting. One per task is fine, MultiQC will handle collapsing the duplicates automatically. And yeah, would also be nice to go into nf-prov / other tooling 👍🏻 |
I've opened a PR #4425 with an alternative and more generic way of approaching this problem. Give it a try or add your comments about that alternative. |
Explaining a bit what I don't like about adding a "versions" directive and why I proposed this "custom traces" #4425 approach. To add a directive for collecting the versions of the tools at runtime seems a bit like an odd feature. It suggests that something might be failing in the software management system; otherwise, all the versions should be fixed and known beforehand without any doubt. However, I understand that reality can be different. Therefore, what we need is a directive to collect runtime metadata of the executed processes. This is why I decided to give the "customTraces" directive approach a try. To me, it also makes no sense to inject this metadata into a channel because it would mean mixing logics. This metadata is not necessary for the execution of the actual data flows. In other words, it should be possible to completely remove this metadata collection, and the pipeline should still function as intended (similar to how trace metrics are optionally collected). |
Another -huge- advantage of tracking versions without channels is that it won't clutter the DAG view, finally making it readable on nf-core-style pipelines |
@pditommaso on Jordi's PR:
Are you okay with a I don't really care whether the tool versions are captured through a specialized directive or a more generic directive. The ability to specify custom trace fields will be useful in the future for other reasons like fine-grained resource prediction, but the advantage of a specialized |
Taking into consideration this and the discussion for "custom traces" (#4425), I think the problem could be decouple into two main goals:
Command outputRegarding the first point, nextflow allows the execution of an arbitrary command or user script. Therefore, the main need is to remove the boilerplate that is currently needed to isolate and capture the output of commands that have special semantics e.g. version. My proposal is to add a new output type that, instead of capturing a file ( A hypothetical syntax could be: process something {
output:
command('bedtools --version')
'''
bedtools --this -that
'''
} Alternatively, it could be implemented as an optional parameter of the existing process something {
output:
stdout() // capture the standard output of the main script
stdout('bedtools --version') // capture the standard output of the command specified as argument
'''
bedtools --this -that
'''
} Collecting output across the processRegarding the second point, instead of collecting the outputs for all processes, as it's done via the mix pattern, a much better approach would be to reverse the logic, with the processes emitting the output over the same "sematic" channel. The topic channels provide an excellent fit for this use case, bringing two benefits: a. remove the The resulting syntax would be: process foo {
output:
command('bedtools --version'), topic: versions
script:
'''
bedtools --this --that
'''
}
process bar {
output:
command('bowtie --version'), topic: versions
script:
'''
bowtie --this --that
'''
}
workflow {
foo()
bar()
multiqc ( channel.topic('version'), etc )
} |
I like this alot, it's simple but effective. Just some thoughts on it:
What I'm trying to say is that overloading import tensorflow as tf; print(tf.__version__) |
New feature
Many pipelines attempt to catch the software versions running within the container. The most common method is to catch a file with the software version written to it, typically using some fun string parsing:
https://github.com/nf-core/modules/blob/72f633b541815ae9565599e5b3cf66d863736ae1/modules/nf-core/fastqc/main.nf#L38-L41
https://github.com/epi2me-labs/wf-template/blob/3edea44cdf9449093a2adc2f08bb16ec143373bf/main.nf#L20-L30
This populates the output directive with an additional channel and prevents people using the pipe operator amongst other things.
Why not get this into a separate, native directive? It could automatically populate some property of the
process
which is available to use as a string (or map of strings).Usage scenario
Anyone who wants to record their software at runtime.
Suggest implementation
Somewhere within the .command.run the .task.versions file is created:
Back in the pipeline code:
The text was updated successfully, but these errors were encountered: