-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow output definitions and schema #4670
Comments
I see like this:
Modelling workflow outputsThe point at 2 is the core. The goal is to provide a way to define the structure of the output that be produced by a pipeline execution in a concise and expressive way, and keep this into a single place. A core pattern that needs to be addressed is how to define a semantic-directory structure collection of some or all workflow output files and give the ability to create (optionally) one or more index files; that's another very common practice. For the sake of this example, let's borrow the recent introduced topic channel. This allows collecting all output files created during the pipeline execution having the same semantic (or just topic name). We could use this concept (syntax purely hypothetical) to write something like this
The type information defined at 1 (i.e. The overall |
I like the idea of using the channel topics to hold workflow outputs. One thing that's missing from your example is the metadata. Once we have record types this will be easier, but for now I think we'll need some way to translate the standard tuple-with-meta-map pattern into a record for the index file, so that the metadata can be saved as columns. I think we can add some config to this output DSL to control this mapping, with some sensible default for the most common use case. |
I believe this needs to be inferred from the |
I feel like the |
From the chat Paolo and I had on Monday, I would say one key aspect of point 1 on This work package to upgrade the definition of outputs at the workflow-level should leave that info at module level, and ensure it is defined in such a way that can then be processed to match the workflow level output definition (point 2). I agree, Ben, that ideally the output types should be in the process def. For point 2. above, probably everything could then be sourced from the process def, I wonder. -- However, there is also point 3 on the output schema. Again, to preserve modularity, key process-level information should be defined at module-level, gathered and processed together with knowledge from 2. to produce the workflow output schema. Where would this module-level info go? The process definition would probably become too cluttered, and that's where the And here comes a second key point that emerged during the Monday chat: is JSON a good format for these schemas? do we have room to consider alternatives? (nf-core requirements?) YAML could be a more readable and approachable alternative to pipeline devs. -- Question to close, going back to Ben's comments: shall we prioritise work to support Types in the process definition as a prerequisite to this one? Or can we start working on the re-design of the workflow-level output definition (point 2.), and defer Type support to later? |
I agree with that, but having a static type system in the language will require some time. My proposal is to rely on types the
I believe the approach proposed allows us to combine modularity with overall output and schema definition.
Think I have answered above |
Honestly I would rather not deal with types at all. I thought we agreed it was a separate effort from workflow outputs. Also, if we use channel topics to facilitate the workflow outputs then we don't really need the The 99% use case for metadata is as a tuple with a meta map as the first element. When an output is structured in this way, it is trivial to transform it into a record for the index file. |
Agree, I'll open a separate ticket to track it independently |
Okay I think I have enough clarity now to draft a prototype for the outputs DSL proposed by Paolo. I will aim to have it ready by the end of next week. |
A comment on Paolo's concept above. Do we need to use the |
Well, I really like the topic approach because it's intuitive and should be easy for users to transition to it from publishDir. Of course, workflow outputs are still defined in part at the process level, but not nearly as much as before. With this outputs DSL, I could search the topic name over the project directory and get a view of the processes that push to that topic, whereas currently I have no such top-level identifier. One issue with listing the processes is that we need to target specific outputs, not just the process. We could technically do this with the emit names, but this is much more verbose and I'm willing to bet that many users would lean towards using the topics. Let's try the topic approach (should be easy to implement) and see how it looks with rnaseq |
Ben, thank you for articulating further, helpful for the conversation. I agree with your points, let's start with the I will still share an extra thought that is secondary to the first draft, but I think useful in view of 2nd pass refinements, and to dicuss the broader context. Default output topic: what about having a workflow always implicitly defining a Along this line, we could define a few output internal defaults in Nextflow:
So, keeping in mind Paolo's original example from above (output index omitted):
We could provide a minimal output specification that leverages defaults:
In most production pipelines people would also resort to multiple |
Essentially a kind of implicit topic. I though the same, in any case let's draft prototype using explicit topic names, we can decide later how to refine it |
I have hit an issue with the topic approach. If a process is imported multiple times e.g. under different aliases or by different workflows, there is no way to distinguish them by topic, because the topic mapping is defined in the process definition. For example, the It seems like the topic approach won't work because it cannot be as precise as process selectors which are used heavily with publishDir. |
@marcodelapierre you might have had the right idea. Adapting the DSL example to use process selectors: // nextflow.config
process {
withName: 'GUNZIP_.*|MAKE_TRANSCRIPTS_FASTA' {
publishDir = [
path: { params.save_reference ? "${params.outdir}/genome" : params.outdir },
mode: params.publish_dir_mode,
saveAs: { filename -> (filename != 'versions.yml' && params.save_reference) ? filename : null }
]
}
} // main.nf
output {
genome {
withName 'GUNZIP_.*'
withName 'MAKE_TRANSCRIPTS_FASTA'
path 'genome'
// pattern '...'
// index { ... }
}
} We can allow multiple |
Working through rnaseq with this pattern, I have essentially re-created the YAML output schema that I showed previously, but in the DSL: output {
aligner {
path params.aligner
withName '.*:QUANTIFY_STAR_SALMON:SALMON_QUANT'
withName '.*:QUANTIFY_STAR_SALMON:CUSTOM_TX2GENE'
withName '.*:QUANTIFY_STAR_SALMON:TXIMETA_TXIMPORT'
withName '.*:QUANTIFY_STAR_SALMON:SE_.*'
withName 'NFCORE_RNASEQ:RNASEQ:SAMTOOLS_SORT' {
when params.save_align_intermeds || params.save_umi_intermeds
pattern '*.bam'
}
withName 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON' {
when params.save_align_intermeds || params.save_umi_intermeds
pattern '*.bam'
}
withName 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON' {
when params.save_align_intermeds || params.save_umi_intermeds
subpath 'umitools/log'
pattern '*.log'
}
// ...
}
} |
Tricky point. I don't love the use of |
admittedly I hadn't foreseen the topics vs aliases clash 😅 .. but glad to have given you some insipration 😉 |
@pditommaso 👍 let's go ahead with the POC and evaluate the We do need a way to match output patterns to appropriate processes in the workflows .. do you reckon using |
Yeah, I think we need some kind labeling/annotation system to handle this. It may overlap with the #4510 effort |
This is not something that can be pushed off for later. nf-core pipelines will not use it if they can't differentiate between process inclusions. On top of that, specific processes have some specific config like toggling based on a param which cannot be expressed with topics. The thing about noise is that rnaseq will be complicated no matter how you slice it. It's just a really big pipeline. The metric I would use instead is, what is the essential information that needs to be expressed and what is the simplest way to express it? So, we need to be able to capture specific outputs of specific process invocations. The pull approach would be to use process selectors and file patterns as shown above. The push approach would be... you would have to apply some annotations to the process invocation in the workflow. That strikes me as being worse than the process selectors, but I'm happy to be proven wrong. To Paolo's point about module config, this could simplify things by breaking up the output definition into modules, just as I have done with the config files. Each subworkflow could define it's own outputs (now feasible with process selectors). But I'll have to think about how to handle composition and overrides. I initially used |
I agree that's an important point for real-world pipelines, but I don't not agree that's necessary for a POC to validate the concept of
|
Well if it makes no difference for the POC, the two approaches are about the same difficulty to implement, so I will use an approach that is not so likely to be immediately discarded |
The So, I would suggest we go ahead with the POC by using |
I just submitted this Issue here for So I am wondering if the approaches proposed in here would be amenable to saving the final metadata about published files into something like the manifest JSON? on a separate note, I was also planning to investigate moving all of the
so I am also wondering if the proposed methods here would accomplish this and make it easier to put all these in a single config file? I also noticed this discussion here #2723 which suggests that there could be major changes to the config file format, so I cant help but wonder how that might also play into this. due to the huge number of different files being output by pipelines, it seems like documentation always ends up being needed to be written just to describe what a pipeline's output files are. Maybe there's some way that a dedicated schema definition used internally by Nextflow could be easily exported and turned into the starting point of such "pipeline output file docs"? |
Yes, this is something we want to achieve with this effort. I would be interested to learn more what metadata would be useful for your use case. Do you have any example to share.
The main goal is to go beyond the concept of |
@pditommaso update after the team meeting. @bentsherman consulted with SciDev, and as a result he drafted a 2nd POC based on alternate approach. He is also going to discuss the two POCs with the community during the current Hackathon (19-20 March) So now we have:
Option 2. seems an excellent upgrade compared to the 1st iteration in terms of general design. Please Paolo, everyone, feel free to have a look and share feedback on it! |
The second one is still far from what was proposed:
|
Topics do not work here, I have already explained this
This is more of an optional add-on, it can be easily added at any time if users would find it valuable
These are minor differences that can be easily changed, I wouldn't focus on them |
Paolo, also note that in the 2nd POC I agree that we should discuss 1. a couple of comments above, i.e. means of aggregating/specifying outptuts. |
Pinging also @ewels in case you have any comments to share :-) |
It's actually straightforward to make the channel topics flexible enough to work here. You just need an operator e.g. FOO()
FOO.out.fastq | topic('fastq') This way you can send output channels of a specific process invocation to a topic, just like the Then we can support the topic approach in the output DSL: output {
path("${params.outdir}/fastq") {
topic 'fastq'
// just a shortcut for...
select Channel.topic('fastq')
}
} The way I see it, channel selectors as shown in nf-core/fetchngs#302 are the core functionality, and things like topic selectors and index file creation are optional enhancements. |
Is there an example of solution 2 for rnaseq? |
Agree that could be a nice way to handle it |
It makes a lot of sense, glad to see how some pieces are coming together here ... very good team conversation! :-) And by forwarding channels into topics beforehand, for certain pipelines we would also be able to reduce congestion in the |
Following up on Paolo's feedback, @bentsherman I would suggest to go ahead with implementation and POC along these lines |
Topics adds another layer for someone to learn and understand, output channels is direct and similar to patterns they will already been using. I'm not 100% against using them, but I want to see a clear benefit over just using a channel. |
Or make it optional!
|
@adamrtalbot agree 100%, I see channel topics as a convenience over the otherwise boring but reliable approach of channel selectors. The nice thing about channel selectors is that we're already gonna expand the workflow emits in order to make it easier to import pipelines as subworkflows. Where I think topics could be useful is to publish outputs that you wouldn't typically emit to downstream workflows, i.e. intermediate outputs. @pditommaso not yet, but I think I will try it once I update the Nextflow implementation. It should be a good way to test the usefulness of topics |
Agree with @adamrtalbot (I think I'm just his echo most of the time) - I think topics is an unnecessary layer for the basic use case. In fact I'm a little fuzzy on them myself. |
Team, we are still prototyping this feature and I believe topic channels as suggested in the latest comments should included. Please go ahead. |
@bentsherman @pditommaso question I was reviewing our current implementation for Right now (see e.g. docs, and also Ben's example above) the way to refer to topics is with the syntax:
As a topic is ultimately a type of channel, why not enabling a more consistent and concise syntax:
This would be quite beneficial for the user experience, also enabling people to think of a topic as just any kind of channel. (quoting)
In the end, as Ben said, using a topic channel can help publishing intermediate outputs in a concise way. In doing so, I would try and keep the user experience/learning curve as smooth as reasonably possible. |
Broadly speaking, I believe topic channel needs to be called out explicitly due their "global" nature. However, I would consider a non-goal of this feature changing the topic syntax. Let's focus on the core functionality |
nf-core/fetchngs#275 is working with channel selectors. Topics were not needed. Took a stab at the rnaseq poc, but it is much less straightforward overall with channel selectors. I might just try to get a rough approximation to start. Maybe do everything with topics first and then see where it can be simplified @marcodelapierre FWIW I like the idea of making the topic like a global variable rather than a string literal, or at the very least not allowing topic names to be dynamic. That would allow the compiler to trace the use of a topic throughout the codebase, i.e. "go to definition" / "go to references". I'll keep that in mind for the language server |
Interestingly, the channel selector is just the publish operator that many have requested (#1540), with the topic selector being syntax sugar over that. The difference here is restricting the use of this "publish op" to a dedicated output block rather than allowing it to be used arbitrarily in the workflow logic. |
Spun off from #4042 (comment)
Currently, the
publishDir
directive is used to define "workflow outputs". It is a "push" model in which individual processes are responsible for deciding which of their outputs should be workflow outputs, and how to publish them. It was created in the days of DSL1 when there was no concept of workflows / subworkflows.With DSL2 and the introduction of workflows, publishDir is an unwieldy way to define workflow outputs. For one thing, it is extremely difficult to get a full picture of a workflow's outputs because they are scattered across the process definitions. Additionally, a process could be invoked many different times by different workflows, and each workflow might have a different idea of its own outputs. All in all, it no longer makes sense to define workflow outputs in the process definition.
Instead, it should be possible to specify an output schema for a workflow, which defines the full set of published outputs, as well as metadata, in terms of the processes (and subworkflows) that the workflow invokes.
The two options discussed so far are:
output
block) which can reference the outputs of subcomponentsSee also #4661 for discussion of where to put additional publish options such as mode and saveAs. The publish mode could become a config option or process directive (e.g.
publishOptions.mode
) to allow for process-specific configuration, or they could become part of the output schema.The text was updated successfully, but these errors were encountered: