'versions' directive in process #4386

adamrtalbot · 2023-10-09T10:32:54Z

New feature

Many pipelines attempt to catch the software versions running within the container. The most common method is to catch a file with the software version written to it, typically using some fun string parsing:

https://github.com/nf-core/modules/blob/72f633b541815ae9565599e5b3cf66d863736ae1/modules/nf-core/fastqc/main.nf#L38-L41

https://github.com/epi2me-labs/wf-template/blob/3edea44cdf9449093a2adc2f08bb16ec143373bf/main.nf#L20-L30

This populates the output directive with an additional channel and prevents people using the pipe operator amongst other things.

Why not get this into a separate, native directive? It could automatically populate some property of the process which is available to use as a string (or map of strings).

Usage scenario

Anyone who wants to record their software at runtime.

Suggest implementation

process FASTQC {
    // process goes here
    versions:
    fastqc = 'fastqc --version | sed -e "s/FastQC v//g"'
    python = 'python --version'
}

Somewhere within the .command.run the .task.versions file is created:

fastqc: 0.12.1
python: 3.11.2

Back in the pipeline code:

FASTQC.versions == [ name: ${task.process}, versions: [ fastqc: "0.12.1", python: "3.11.2" ]  ]

// It's syntactic sugar around the FASTQC.out.versions channel, so can be used like so:
ch_versions = ch_versions.mix(FASTQC.versions)

The text was updated successfully, but these errors were encountered:

ewels · 2023-10-09T10:40:35Z

I think that this is basically a duplicate of an issue that I wrote with @pditommaso back in 2018 😬 #879

But same motivation, similar syntax even (this is slightly better though, agree that the ability to have more than 1 version per process is necessary 👍🏻 )

adamrtalbot · 2023-10-09T13:17:39Z

Ah bother. I did search for versions but I see that feature is slightly more generic. You could probably use my this feature to implement generic info as well as versions.

The biggest flaw I can see right now is stealing a secret. From the example above:

versions:
sneaky-secret-steal = "echo $SECRET"

pditommaso · 2023-10-09T13:31:04Z

I like the general design, still find it horrible having to run and parse all these CLI option. would not be better to rely on the version declared by conda and spack packages?

adamrtalbot · 2023-10-09T13:48:44Z

In my personal experience, conda only contains about 25% of tools used in bioinformatics (although they are the big and important ones).

ewels · 2023-10-09T20:09:15Z

Yup, we've gone back and forth on this a lot over the years. At the end of the day, the only thing you can really trust is to ask the tool itself. Also many pipelines (eg. nf-core) specify both conda and docker. Both can be overwritten by the end user. So if running with some random Docker image there's really no way to know what the version is.

pditommaso · 2023-10-10T10:30:35Z

Fair enough. Other points address 1) cardinality: should this be collected once per process or once per task execution; 2) how do you expect this information to be consumed?

adamrtalbot · 2023-10-10T12:57:12Z

Yes, once per task. It's easy enough to reduce the data to per-process later. Likely is to consume it for an output file (collectFile), or wrap into an output tool like nf-prov.

ewels · 2023-10-11T08:30:56Z

My hope is to use the new MultiQC software versions YAML files for reporting. One per task is fine, MultiQC will handle collapsing the duplicates automatically.

And yeah, would also be nice to go into nf-prov / other tooling 👍🏻

jordeu · 2023-10-18T10:16:51Z

I've opened a PR #4425 with an alternative and more generic way of approaching this problem. Give it a try or add your comments about that alternative.

jordeu · 2023-10-26T08:07:06Z

Explaining a bit what I don't like about adding a "versions" directive and why I proposed this "custom traces" #4425 approach.

To add a directive for collecting the versions of the tools at runtime seems a bit like an odd feature. It suggests that something might be failing in the software management system; otherwise, all the versions should be fixed and known beforehand without any doubt.

However, I understand that reality can be different. Therefore, what we need is a directive to collect runtime metadata of the executed processes. This is why I decided to give the "customTraces" directive approach a try. To me, it also makes no sense to inject this metadata into a channel because it would mean mixing logics. This metadata is not necessary for the execution of the actual data flows. In other words, it should be possible to completely remove this metadata collection, and the pipeline should still function as intended (similar to how trace metrics are optionally collected).

muffato · 2023-10-26T09:03:18Z

To me, it also makes no sense to inject this metadata into a channel because it would mean mixing logics.

Another -huge- advantage of tracking versions without channels is that it won't clutter the DAG view, finally making it readable on nf-core-style pipelines

bentsherman · 2023-10-27T20:35:32Z

@pditommaso on Jordi's PR:

I'm not convinced we should proceed down this path. It would lead to even more do-it-yourself metadata, which the nextflow runtime would be totally unaware of. Think we should have a more controlled approach

Are you okay with a versions directive then?

I don't really care whether the tool versions are captured through a specialized directive or a more generic directive. The ability to specify custom trace fields will be useful in the future for other reasons like fine-grained resource prediction, but the advantage of a specialized versions directive is that it provides a standard way for Nextflow to understand "tool versions", which is useful not only for multiqc but also provenance reports like RO-crate.

pditommaso · 2023-11-01T08:50:59Z

Taking into consideration this and the discussion for "custom traces" (#4425), I think the problem could be decouple into two main goals:

Provide a mechanism to carry out semantic-oriented command in which output needs to be captured
Provide the ability to collect all outputs with the same semantics across multiple processes.

Command output

Regarding the first point, nextflow allows the execution of an arbitrary command or user script. Therefore, the main need is to remove the boilerplate that is currently needed to isolate and capture the output of commands that have special semantics e.g. version.

My proposal is to add a new output type that, instead of capturing a file (path) or an environment variable (env), allows the execution of a user-provided command, the output of which is emitted as a channel value.

A hypothetical syntax could be:

process something {
  output: 
  command('bedtools --version')

  '''
  bedtools --this -that
  '''
}

Alternatively, it could be implemented as an optional parameter of the existing stdout output type, e.g.

process something {
  output:
  stdout()                     // capture the standard output of the main script   
  stdout('bedtools --version') // capture the standard output of the command specified as argument 

  '''
  bedtools --this -that
  '''
}

Collecting output across the process

Regarding the second point, instead of collecting the outputs for all processes, as it's done via the mix pattern, a much better approach would be to reverse the logic, with the processes emitting the output over the same "sematic" channel.

The topic channels provide an excellent fit for this use case, bringing two benefits:

a. remove the mix pattern boilerplate code
b. improve modularization, because any

The resulting syntax would be:

process foo {
  output: 
  command('bedtools --version'), topic: versions
  script: 
  ''' 
  bedtools --this --that
  '''
} 

process bar {
  output: 
  command('bowtie --version'), topic: versions
  script: 
  ''' 
  bowtie --this --that
  '''
}

workflow {
   foo()
   bar() 
   multiqc ( channel.topic('version'), etc )
}

phue · 2023-11-08T22:39:01Z

I like this alot, it's simple but effective.

Just some thoughts on it:

command() in the examples above could also be bash(), because that's what it is effectively

What I'm trying to say is that overloading stdout() seems to be a more powerful choice as it could generalise to other interpreters, e.g using python:

import tensorflow as tf; print(tf.__version__)

bentsherman · 2024-04-19T15:02:46Z

Solved by #4459 and #4493

bentsherman added lang/processes good first issue labels Oct 12, 2023

bentsherman added this to Hackathon: November 2023 Oct 12, 2023

bentsherman assigned jordeu Oct 17, 2023

jordeu mentioned this issue Oct 18, 2023

Allow to add custom traces and use them as metadata #4425

Closed

marcodelapierre assigned pditommaso Nov 7, 2023

pditommaso unassigned pditommaso and jordeu Nov 7, 2023

pditommaso removed the good first issue label Nov 8, 2023

pditommaso mentioned this issue Nov 8, 2023

Add cmd output type #4493

Merged

marcodelapierre mentioned this issue Nov 14, 2023

Add "topic" channel #4459

Merged

bentsherman mentioned this issue Apr 19, 2024

add a "version" output to Nextflow Process #4934

Closed

bentsherman closed this as completed Apr 19, 2024

bentsherman mentioned this issue May 1, 2024

Proposal: Static types nf-core/fetchngs#309

Draft

bentsherman mentioned this issue May 21, 2024

Proposal: DSL2+ nf-core/fetchngs#312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'versions' directive in process #4386

'versions' directive in process #4386

adamrtalbot commented Oct 9, 2023

ewels commented Oct 9, 2023

adamrtalbot commented Oct 9, 2023

pditommaso commented Oct 9, 2023

adamrtalbot commented Oct 9, 2023

ewels commented Oct 9, 2023

pditommaso commented Oct 10, 2023

adamrtalbot commented Oct 10, 2023 •

edited

Loading

ewels commented Oct 11, 2023

jordeu commented Oct 18, 2023

jordeu commented Oct 26, 2023

muffato commented Oct 26, 2023 •

edited

Loading

bentsherman commented Oct 27, 2023

pditommaso commented Nov 1, 2023 •

edited by ewels

Loading

phue commented Nov 8, 2023 •

edited by ewels

Loading

bentsherman commented Apr 19, 2024

'versions' directive in process #4386

'versions' directive in process #4386

Comments

adamrtalbot commented Oct 9, 2023

New feature

Usage scenario

Suggest implementation

ewels commented Oct 9, 2023

adamrtalbot commented Oct 9, 2023

pditommaso commented Oct 9, 2023

adamrtalbot commented Oct 9, 2023

ewels commented Oct 9, 2023

pditommaso commented Oct 10, 2023

adamrtalbot commented Oct 10, 2023 • edited Loading

ewels commented Oct 11, 2023

jordeu commented Oct 18, 2023

jordeu commented Oct 26, 2023

muffato commented Oct 26, 2023 • edited Loading

bentsherman commented Oct 27, 2023

pditommaso commented Nov 1, 2023 • edited by ewels Loading

Command output

Collecting output across the process

phue commented Nov 8, 2023 • edited by ewels Loading

bentsherman commented Apr 19, 2024

adamrtalbot commented Oct 10, 2023 •

edited

Loading

muffato commented Oct 26, 2023 •

edited

Loading

pditommaso commented Nov 1, 2023 •

edited by ewels

Loading

phue commented Nov 8, 2023 •

edited by ewels

Loading