Allow to add custom traces and use them as metadata #4425

jordeu · 2023-10-18T10:10:15Z

Description

This PR allows the user to add custom traces on each process. Also exposes them as workflow metadata, so the user can use them without any extra channel.

This is a more generic approach to solve the version tracking problem described at #4386

The idea is to allow the user to collect custom traces before running the command. The custom traces is a map where the key is the name of the custom trace and the value is an string with the bash script to run to collect that trace. The output is always parsed as an string.

Then all the traces of the completed and resumed tasks are available as workflow metadata at workflow.traces. If workflow.traces is used as part of a process script it will contain all the traces of completed task just before submitting that process.

You can use workflow.traces at workflow.onComplete if you want to store custom traces into a file or do something else.

Example pipeline

See the whole pipeline here

main.nf

def parseVersions(traces) {
  result = ''
  for( t in traces ) {
    v = t.getStore().keySet().findAll{ it.startsWith('custom_version_') }.collect{ "$it: ${t.getStore().get(it)}" }
    if( v ) { 
      result += t.processName + ":\n    " + v.join("\n    ") + "\n"
    }
  } 
  return result
}


process FASTQC {
  conda "bioconda::fastqc=0.11.9"
  customTraces version_fastqc: "fastqc --version | sed -e 's/FastQC v//g'"

  output:
    path 'output.txt'

  """
  echo "FASTQC" > output.txt
  """
}


process BAMTOOLS {
  conda "bioconda::bamtools=2.5.2"
  customTraces version_bamtools: "bamtools --version | grep -e 'bamtools' | sed 's/^.*bamtools //'"

  output:
    path 'output.txt'

  """
  echo "BAMTOOLS" > output.txt
  """
}


process MULTIQC {

  input:
    path 'fastqc.txt'
    path 'bamtools.txt'

  script:
    versions = parseVersions(workflow.traces)
    """
    cat <<-END_VERSIONS > versions.yml
${versions}
END_VERSIONS

    cat fastqc.txt bamtools.txt > result.txt
    """
}

workflow {

  FASTQC()
  BAMTOOLS()

  MULTIQC(FASTQC.out, BAMTOOLS.out)
}

workflow.onComplete {
  println "EXAMPLE VERSIONS YAML:\n${parseVersions(workflow.traces)}"   
}

Notes

This is still a PoC, open to discuss this approach and alternatives.
The workflow.traces is an array of TraceRecord of the completed tasks
Maybe instead of using directly TraceRecord to expose traces to the user, would be better to expose them using a custom class more user friendly.

Signed-off-by: Jordi Deu-Pons <[email protected]>

netlify · 2023-10-18T10:10:20Z

✅ Deploy Preview for nextflow-docs-staging ready!

Name	Link
🔨 Latest commit	`b7b5fed`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/652faf0aacdac30008495b77
😎 Deploy Preview	https://deploy-preview-4425--nextflow-docs-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

muffato · 2023-10-23T09:21:06Z

I very much like the idea of "custom traces". That can be used to collect input sizes (for instance size in bytes of some files, or number of lines, etc).

Question about the interface workflow.traces.getStore().keySet().findAll(...). In the case of a pipeline with processes running concurrently, a process accessing that store will see different values, depending of how early it is run in comparison to the other processes. For reproducibility, I would expect a process to only have access to the traces of its parents. To reliably have access to all traces, the reader process would have to be marked as depending on every stream of processes, to make sure it's run last. Or if a Nextflow process is not required, workflow.onComplete should have access to it too ?

pditommaso · 2023-10-23T12:17:17Z

What would be a use-case for this PR?

jordeu · 2023-10-26T07:50:08Z

I see two use cases:

Collect info of all the tools used in the pipeline and report them on the last process with a tool like MultiQC
Collect custom metrics of your processes (ex: free disk space, input file sizes...) and on workflow.onComplete store them in a file.

jordeu · 2023-10-26T07:54:24Z

Question about the interface workflow.traces.getStore().keySet().findAll(...). In the case of a pipeline with processes running concurrently, a process accessing that store will see different values, depending of how early it is run in comparison to the other processes. For reproducibility, I would expect a process to only have access to the traces of its parents. To reliably have access to all traces, the reader process would have to be marked as depending on every stream of processes, to make sure it's run last. Or if a Nextflow process is not required, workflow.onComplete should have access to it too ?

It's true that when you use workflow.traces you get all the completed tasks and this is not deterministic if you are using this in a process that runs concurrently with others. To only give access to upstream traces would be a bit nicer solution, but don't seems easy to implement. I think that traces are just metadata and should not be used for anything that affects the reproducibility of a pipeline.

bentsherman · 2023-10-27T19:17:58Z

TODO:

rename customTraces to trace
allow trace to be defined multiple times like label, publishDir, etc
remove custom_ prefix, instead make sure custom names don't conflict with standard trace fields
document process directive
document workflow.trace as a metadata field that is only available in onComplete and onError handlers

pditommaso · 2023-10-27T19:30:28Z

I'm not convinced we should proceed down this path. It would lead to even more do-it-yourself metadata, which the nextflow runtime would be totally unaware of. Think we should have a more controlled approach

bentsherman · 2023-10-27T20:29:25Z

Let's take this discussion back to the original issue #4386

pditommaso · 2023-10-30T20:24:18Z

I've looked more into this, and while the idea is interesting and we should pursue it to some extent, what I'm concerned about between the other things is that it's very fragile. it's enough an extra new-line introduced by a custom command to break the trace file. Also, an invalid command can break the command wrapper script, and alter the time metrics collection.

If we want to go ahead with the idea of collecting custom metadata running third-party tools, it should be independent by the trace mechanism

jordeu · 2023-10-31T05:02:13Z

If we want to go ahead with the idea of collecting custom metadata running third-party tools, it should be independent by the trace mechanism

When you say independent... do you mean to isolate them in a new function at the .command.run or something else?

pditommaso · 2023-10-31T07:39:33Z

Yes, possibly. Likely it should be executed just before or after the task command. #540

pditommaso · 2023-11-08T18:03:42Z

Closing in favour of #4459 and #4493.

Allow to add custom traces

b7b5fed

Signed-off-by: Jordi Deu-Pons <[email protected]>

jordeu mentioned this pull request Oct 18, 2023

'versions' directive in process #4386

Closed

marcodelapierre mentioned this pull request Nov 7, 2023

Add "topic" channel #4459

Merged

pditommaso closed this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to add custom traces and use them as metadata #4425

Allow to add custom traces and use them as metadata #4425

jordeu commented Oct 18, 2023 •

edited by ewels

Loading

netlify bot commented Oct 18, 2023 •

edited

Loading

muffato commented Oct 23, 2023

pditommaso commented Oct 23, 2023

jordeu commented Oct 26, 2023

jordeu commented Oct 26, 2023

bentsherman commented Oct 27, 2023

pditommaso commented Oct 27, 2023

bentsherman commented Oct 27, 2023

pditommaso commented Oct 30, 2023

jordeu commented Oct 31, 2023

pditommaso commented Oct 31, 2023

pditommaso commented Nov 8, 2023 •

edited

Loading

Allow to add custom traces and use them as metadata #4425

Allow to add custom traces and use them as metadata #4425

Conversation

jordeu commented Oct 18, 2023 • edited by ewels Loading

Description

Example pipeline

Notes

netlify bot commented Oct 18, 2023 • edited Loading

✅ Deploy Preview for nextflow-docs-staging ready!

muffato commented Oct 23, 2023

pditommaso commented Oct 23, 2023

jordeu commented Oct 26, 2023

jordeu commented Oct 26, 2023

bentsherman commented Oct 27, 2023

pditommaso commented Oct 27, 2023

bentsherman commented Oct 27, 2023

pditommaso commented Oct 30, 2023

jordeu commented Oct 31, 2023

pditommaso commented Oct 31, 2023

pditommaso commented Nov 8, 2023 • edited Loading

jordeu commented Oct 18, 2023 •

edited by ewels

Loading

netlify bot commented Oct 18, 2023 •

edited

Loading

pditommaso commented Nov 8, 2023 •

edited

Loading