-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow output definition #275
Conversation
Signed-off-by: Ben Sherman <[email protected]>
This comment was marked as resolved.
This comment was marked as resolved.
Clarifying the schema added in this PR is the output equivalent of the samplesheet schema I had considered this schema to be defining one of the inputs/outputs of a pipeline. Whereas the |
This is unique to fetchngs - I would just ignore it and pretend it doesn't exist. |
It's not clear to me what this adds.
Where does this file fit in? |
@evanfloden this output schema is like the @adamrtalbot At the very least, this output schema should be used to validate any samplesheets that are produced, and allow external tools like Seqera Platform to inspect a workflow's expected outputs e.g. for the purpose of chaining pipelines. What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet. |
Not clear to me either. fetchngs uses an exec process to do this, I think this is quite an overhead for every pipeline developer to do. Perhaps something like this could work:
Although it's not clear how you go from channel contents to file contents. |
I think that could work. As long as the channel emits maps (or records once we support record types properly), generating the samplesheet is trivial. |
This looks going in the right direction. One thing I found awful in the current schema is JSON schema that's totally unreadable. Wonder if we should not look into a different system more human friendly |
Possible alternatives |
My biggest concern with this is how unwieldy that file is going to get when we go to defining an output schema from a very simple pipeline like fetchngs to rnaseq. This is why I was suggesting we try and incorporate the publishing logic and output file definition at the module/subworkflow/workflow level and then combine them somehow rather than having one single massive file. I also suspect there will still need to be some sort of "conversion" layer or plugin that can take this output schema file to generate custom csvs/jsons etc which can be used as input downstream for other pipelines. Ideally, this plugin can be invoked outside of the pipeline context. |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
I don't think that we should do this, it breaks how JSON schema validation works. The beauty of using the standard is that very many platforms and libraries use the syntax in the same way. You have a parsed object in memory (be it If we start merging sample sheet schema inside output schema, we can no longer use this for validation. We would have to validate the output files with subsets of the schema, and validate the list of output files with a subset of the schema. If you have to break the schema down to use it, it becomes custom and a lot less useful imho. Separate files is undoubtably more verbose, but it's also much more portable. This is why the |
@pditommaso YAML is fine (and Ben's YAML conversion here hopefully is a lot easier to read), but my strong preference is to stick with as-close-to-as-possible JSON Schema syntax. To clarify, that JSON Schema can be written in YAML (or toml, or really any format), as long as it's laid out with the structure and keywords of JSON schema. The benefit of using it is that there are about a bazillion different implementations so it just works everywhere. In contrast, the Yamale syntax you linked to seems to by a Python tool with it's own schema syntax, so every part of our toolchain would need to build its own parser and validation library for that syntax. The YAML Schema you linked to seems to still be valid JSON Schema, just in YAML format and with a couple of extra keys. That would still work with any JSON Schema implementation, so that'd be fine. But I'm not sure that we're doing anything complex enough to need those extra keywords to be honest. |
Signed-off-by: Ben Sherman <[email protected]>
|
I converted the JSON schema to YAML just to see what it looks like, and it is indeed much simpler. If the JSON schema can be used with YAML "schemas" just the same, that seems like the best approach to me, even for the |
I also added a prototype for the workflow output DSL (see nextflow-io/nextflow#4784). It allows you to define an arbitrarily nested directory structure with Another idea I considered was being able to select channels from the top-level workflow emits, but that is slightly more complicated to implement (and adds some boilerplate to the pipeline code) whereas I found I could get the job done with just the process outputs. I thought about having some DSL method like |
At this point, the output DSL is concerned only with mapping process outputs to a directory structure. Where output schemas could come in is as an optional refinement to describe the structure of specific files: select 'SRA_TO_SAMPLESHEET', pattern: 'samplesheet.csv', schema: 'schema_samplesheet.json'
select 'SRA_TO_SAMPLESHEET', pattern: 'id_mapping.csv', schema: 'schema_mapping.json' So it's still up to the user to generate the output file, and they might even be able to use the same output schema to do it (like Adam's Given this example, I agree with @ewels that it makes more sense to keep the schema for each file separate. I'm imaging a nextflow command to generate some kind of global schema from this output definition (i.e. by the pipeline developer before a version release) for use by external tools. |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
See nf-core/rnaseq#1227 for a similar prototype with rnaseq. It is not for the faint of heart |
Signed-off-by: Ben Sherman <[email protected]>
Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path. |
Good point, being able to publish files as
For what it's worth, this is another good example to drive publishing from channels rather than processes, because then the vals would be in scope. You can see that in action here: https://github.com/nf-core/fetchngs/pull/302/files |
I think I have seen this pattern before, though I couldn't find an example of it in rnaseq. It is a consequence of decoupling the publishing from the task execution. We might be able to recover it in #302 by allowing the path to reference channel items, e.g. given a channel of files with metadata, publish the file to a path based on the meta id, but not sure what that syntax would look like. |
@adamrtalbot good point, with channel selectors we could do something like this: path( "results" ) {
select NFCORE_RNASEQ.out.bam, saveAs: { meta, bam -> "${meta.id}/bam/${bam.name}" }
} The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple. |
Isn't this how it is now? Only |
It's unusual in nf-core, but quite common elsewhere.
My thought was, capture all the contents of the channel,
|
Signed-off-by: Ben Sherman <[email protected]>
Since there is a lot of support for the channel selectors, but most of the discussion is here, I closed the other PR and migrated this one to the channel selectors. I found that topics weren't really needed for fetchngs. We'll see what happens with rnaseq |
@mahesh-panchal yes, because of how process outputs are declared, Nextflow knows which tuple elements are paths and collects them accordingly Since a generic channel doesn't have such a declaration, for now I think we can follow @adamrtalbot 's suggestion and traverse whatever data structure the channel spits out:
Once we have support for record types and better type inference in the language, we'll be able to infer the incoming type and, if it's a record type, scan the type definition for path elements. Much more robust but not a blocker for the quick n dirty solution |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
main.nf
Outdated
'fastq' { | ||
from 'fastq' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the first fastq
is the path, and the second is the topic.
Can it be a bit more explicit?
Thinking something like this at least:
'fastq' { | |
from 'fastq' | |
} | |
'fastq/' { | |
from 'fastq' | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can specify a trailing slash if you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like that, it makes the path more explicit, but I'd rather even have a path
and a topic
specified somewhere, not to be too confused, I think it's better to be a bit more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. This case is simple enough to be confusing, but if you used a regular emit instead of a topic, or you had a more complex directory structure like in rnaseq, it would be clearer.
I did propose calling it fromTopic
to help denote it as a topic, but Paolo was in favor of not having too many different keywords. We could revisit this
I'm not a massive fan of topics, I think they are a little bit hidden and it's hard to identify where they originated. However, I see that there is a value in them, I just wouldn't use them much myself. But the general principle here looks great! |
@adamrtalbot totally fair. fetchngs is simple enough that you could do it without topics, and you are certainly free to scrap the topics in the final implementation. I used them here as an exercise to show how to access those intermediate channels without adding them to the emits. In practice, users will be able to use any mixture of emits / topics based on their preference. |
Signed-off-by: Ben Sherman <[email protected]>
publish: | ||
ch_fastq >> 'fastq/' | ||
ASPERA_CLI.out.md5 >> 'fastq/md5/' | ||
SRA_FASTQ_FTP.out.md5 >> 'fastq/md5/' | ||
SRA_RUNINFO_TO_FTP.out.tsv >> 'metadata/' | ||
ch_versions_yml >> 'pipeline_info/' | ||
ch_samplesheet >> 'samplesheet/' | ||
ch_mappings >> 'samplesheet/' | ||
ch_sample_mappings_yml >> 'samplesheet/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, the "topic" name will be used as the publish path, which makes fetchngs really simple. No need to define any rules in the output DSL, just the base directory and publish mode, then all of these channels will be published exactly as it says.
These names don't have to be paths, they can also be arbitrary names which you would then use in the output DSL to customize publish options for that name. I'll demonstrate this with rnaseq. You can think of the names as "topics" if you want, but at this point I'm not even using topics under the hood because they aren't necessary.
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Folding into #312 |
This PR is a prototype for defining an output schema for a Nextflow pipeline. See nextflow-io/nextflow#4669 and nextflow-io/nextflow#4670 for original discussions.
The meta workflow that we are targeting is:
fetchngs -> rnaseq -> differentialabundance
In other words, we want to eliminate the manual curation of samplesheets between each pipeline. To do this, the output schema should "mirror" the params schema, it should describe the outputs as a collection of samplesheets.
Here is the tree of pipeline outputs for the fetchngs
test
profile:From what I can tell, the
samplesheet.csv
contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.The initial output schema basically describes this samplesheet in a similar manner to the
input_schema.json
file. This particular output schema should closely resemble theinput_schema.json
for nf-core/rnaseq.What I'd to do from here is collect feedback on this approach -- what else is needed to complete the output schema for this pipeline? Then we can think about how to operationalize it in Nextflow -- should Nextflow automatically generate the samplesheet from the schema? how does the schema interact with the publish mechanism? how to collect metadata which normally can't be published directly but only through files?