Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make workflow named outputs show up in the manifest #23

Closed
stevekm opened this issue Mar 7, 2024 · 4 comments
Closed

make workflow named outputs show up in the manifest #23

stevekm opened this issue Mar 7, 2024 · 4 comments

Comments

@stevekm
Copy link
Contributor

stevekm commented Mar 7, 2024

Right now the manifest JSON output looks something like this

{
    "pipeline": null,
    "published": [
        {
            "source": "/pipeline/work/f6/48b7d3a739069878e46051b5a7bbc4/file1.txt",
            "target": "/pipeline/output/file1.txt",
            "publishingTaskId": "16"
        },
        {
            "source": "/pipeline/work/d6/5aede3bcd70eb8ac3fff17b60c033b/file2.txt",
            "target": "/pipeline/output/file2.txt",
            "publishingTaskId": "18"
        },
 ...

However I am able to define my pipeline's main workflow section to have named outputs, like this

// main.nf
nextflow.enable.dsl=2

include { MY_SUBWORKFLOW } from './workflows/do_things.nf'

workflow {
    main:
    samples_ch = Channel.from(file(params.samplesheet))

    MY_SUBWORKFLOW(samples_ch)

    emit:
    myfiles = MY_SUBWORKFLOW.out.allmyfiles
}

It would be really helpful if we could somehow keep the label such as myfiles associated with the published files, maybe something like this

{
    "pipeline": null,
    "published": [
        {
            "source": "/pipeline/work/f6/48b7d3a739069878e46051b5a7bbc4/file1.txt",
            "target": "/pipeline/output/file1.txt",
            "publishingTaskId": "16",
            "emit": "myfiles"
        },
        {
            "source": "/pipeline/work/d6/5aede3bcd70eb8ac3fff17b60c033b/file2.txt",
            "target": "/pipeline/output/file2.txt",
            "publishingTaskId": "18",
            "emit": "myfiles"
        },
 ...

This would be really helpful for downstream processing, so that you could parse the manifest JSON and identify specific files. For example, if you had an emit channel for MultiQC files multiqc_ch, you would be able to identify all the files with the label multiqc_ch to more easily pass them in to some other process, like a chained post-processing workflow.

@pinin4fjords

I noticed that under the tasks section of the manifest JSON, there is an emit field already in the outputs list for each task, however in all my pipelines so far it seems like the value here is null, not sure what this was meant to be used for but it seems like maybe this functionality might overlap?

@bentsherman
Copy link
Member

I think the original creator of nf-prov tried to associate published outputs with the process emit, but maybe they never got it to work. As long as a file is emitted by any process output channel, it can be published, but it could be emitted by multiple process outputs.

But the problem with your request is that published outputs are not related to workflow emits at all. More fundamentally, I'm not sure that the provenance manifest is the best way to facilitate the chaining of pipelines.

I think we need some kind of workflow output schema which can be easily matched to the input schema of a downstream workflow, which does not involve workflow emits at all.

Alternatively, you could write a "meta-pipeline" which imports entire pipelines as modules and chains them together with regular dataflow logic. That would use the workflow takes/emits but not the input/output schemas, which in this case would be an unnecessary extra step. I am working on a proof-of-concept for this using fetchngs+rnaseq, hope to finish it at the hackathon next week.

@pinin4fjords
Copy link

Alternatively, you could write a "meta-pipeline" which imports entire pipelines as modules and chains them together with regular dataflow logic.

This should definitely be a thing. The main blockers on this (in nf-core at least) have been config-based, and @drpatelh 's related plans should help.

@stevekm
Copy link
Contributor Author

stevekm commented Mar 13, 2024

Honestly, I am not really a big fan of the idea of writing "meta-pipelines" because then it seems you would have to write one for every combination of pipelines you want to chain together.

I feel like this is the better approach;

I think we need some kind of workflow output schema which can be easily matched to the input schema of a downstream workflow

( which feels related to this nextflow-io/nextflow#4670 )

an idea floated elsewhere, was some mechanism by which you could chain pipelines in a manner like this

nextflow run main1.nf -output-schema-stdout ... |  nextflow run main2.nf -input-schema-stdin

The topic of 'pipeline chaining' per se is likely out of scope for this Issue and Repo, maybe it can be moved to some other location. But if "named outputs" were available in the nf-prov (or elsewhere??) then at least we could more easily hack it together ourselves :)

feel free to close this issue if think there's a better place for the discussions, thanks

@bentsherman
Copy link
Member

I see you have commented on nextflow-io/nextflow#4670, let's move the discussion over there. Your feedback might help us finalize the design of the workflow output schema which should be the easiest way to chain pipelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants