Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to add custom traces and use them as metadata #4425

Closed
wants to merge 1 commit into from

Conversation

jordeu
Copy link
Collaborator

@jordeu jordeu commented Oct 18, 2023

Description

This PR allows the user to add custom traces on each process. Also exposes them as workflow metadata, so the user can use them without any extra channel.

This is a more generic approach to solve the version tracking problem described at #4386

The idea is to allow the user to collect custom traces before running the command. The custom traces is a map where the key is the name of the custom trace and the value is an string with the bash script to run to collect that trace. The output is always parsed as an string.

Then all the traces of the completed and resumed tasks are available as workflow metadata at workflow.traces. If workflow.traces is used as part of a process script it will contain all the traces of completed task just before submitting that process.

You can use workflow.traces at workflow.onComplete if you want to store custom traces into a file or do something else.

Example pipeline

See the whole pipeline here

main.nf

def parseVersions(traces) {
  result = ''
  for( t in traces ) {
    v = t.getStore().keySet().findAll{ it.startsWith('custom_version_') }.collect{ "$it: ${t.getStore().get(it)}" }
    if( v ) { 
      result += t.processName + ":\n    " + v.join("\n    ") + "\n"
    }
  } 
  return result
}


process FASTQC {
  conda "bioconda::fastqc=0.11.9"
  customTraces version_fastqc: "fastqc --version | sed -e 's/FastQC v//g'"

  output:
    path 'output.txt'

  """
  echo "FASTQC" > output.txt
  """
}


process BAMTOOLS {
  conda "bioconda::bamtools=2.5.2"
  customTraces version_bamtools: "bamtools --version | grep -e 'bamtools' | sed 's/^.*bamtools //'"

  output:
    path 'output.txt'

  """
  echo "BAMTOOLS" > output.txt
  """
}


process MULTIQC {

  input:
    path 'fastqc.txt'
    path 'bamtools.txt'

  script:
    versions = parseVersions(workflow.traces)
    """
    cat <<-END_VERSIONS > versions.yml
${versions}
END_VERSIONS

    cat fastqc.txt bamtools.txt > result.txt
    """
}

workflow {

  FASTQC()
  BAMTOOLS()

  MULTIQC(FASTQC.out, BAMTOOLS.out)
}

workflow.onComplete {
  println "EXAMPLE VERSIONS YAML:\n${parseVersions(workflow.traces)}"   
}

Notes

  • This is still a PoC, open to discuss this approach and alternatives.
  • The workflow.traces is an array of TraceRecord of the completed tasks
  • Maybe instead of using directly TraceRecord to expose traces to the user, would be better to expose them using a custom class more user friendly.

Signed-off-by: Jordi Deu-Pons <[email protected]>
@netlify
Copy link

netlify bot commented Oct 18, 2023

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit b7b5fed
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/652faf0aacdac30008495b77
😎 Deploy Preview https://deploy-preview-4425--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@muffato
Copy link

muffato commented Oct 23, 2023

I very much like the idea of "custom traces". That can be used to collect input sizes (for instance size in bytes of some files, or number of lines, etc).

Question about the interface workflow.traces.getStore().keySet().findAll(...). In the case of a pipeline with processes running concurrently, a process accessing that store will see different values, depending of how early it is run in comparison to the other processes. For reproducibility, I would expect a process to only have access to the traces of its parents. To reliably have access to all traces, the reader process would have to be marked as depending on every stream of processes, to make sure it's run last. Or if a Nextflow process is not required, workflow.onComplete should have access to it too ?

@pditommaso
Copy link
Member

What would be a use-case for this PR?

@jordeu
Copy link
Collaborator Author

jordeu commented Oct 26, 2023

I see two use cases:

  • Collect info of all the tools used in the pipeline and report them on the last process with a tool like MultiQC
  • Collect custom metrics of your processes (ex: free disk space, input file sizes...) and on workflow.onComplete store them in a file.

@jordeu
Copy link
Collaborator Author

jordeu commented Oct 26, 2023

Question about the interface workflow.traces.getStore().keySet().findAll(...). In the case of a pipeline with processes running concurrently, a process accessing that store will see different values, depending of how early it is run in comparison to the other processes. For reproducibility, I would expect a process to only have access to the traces of its parents. To reliably have access to all traces, the reader process would have to be marked as depending on every stream of processes, to make sure it's run last. Or if a Nextflow process is not required, workflow.onComplete should have access to it too ?

It's true that when you use workflow.traces you get all the completed tasks and this is not deterministic if you are using this in a process that runs concurrently with others. To only give access to upstream traces would be a bit nicer solution, but don't seems easy to implement. I think that traces are just metadata and should not be used for anything that affects the reproducibility of a pipeline.

@bentsherman
Copy link
Member

TODO:

  • rename customTraces to trace
  • allow trace to be defined multiple times like label, publishDir, etc
  • remove custom_ prefix, instead make sure custom names don't conflict with standard trace fields
  • document process directive
  • document workflow.trace as a metadata field that is only available in onComplete and onError handlers

@pditommaso
Copy link
Member

I'm not convinced we should proceed down this path. It would lead to even more do-it-yourself metadata, which the nextflow runtime would be totally unaware of. Think we should have a more controlled approach

@bentsherman
Copy link
Member

Let's take this discussion back to the original issue #4386

@pditommaso
Copy link
Member

I've looked more into this, and while the idea is interesting and we should pursue it to some extent, what I'm concerned about between the other things is that it's very fragile. it's enough an extra new-line introduced by a custom command to break the trace file. Also, an invalid command can break the command wrapper script, and alter the time metrics collection.

If we want to go ahead with the idea of collecting custom metadata running third-party tools, it should be independent by the trace mechanism

@jordeu
Copy link
Collaborator Author

jordeu commented Oct 31, 2023

If we want to go ahead with the idea of collecting custom metadata running third-party tools, it should be independent by the trace mechanism

When you say independent... do you mean to isolate them in a new function at the .command.run or something else?

@pditommaso
Copy link
Member

Yes, possibly. Likely it should be executed just before or after the task command. #540

@pditommaso
Copy link
Member

pditommaso commented Nov 8, 2023

Closing in favour of #4459 and #4493.

@pditommaso pditommaso closed this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants