Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for output data of a pipeline #4

Closed
apetkau opened this issue Aug 21, 2023 · 3 comments
Closed

Ideas for output data of a pipeline #4

apetkau opened this issue Aug 21, 2023 · 3 comments

Comments

@apetkau
Copy link
Member

apetkau commented Aug 21, 2023

1. Output JSON file

In order for pipeline results to be loaded by IRIDA Next, an output.json (or output.json.gz) file should be produced that informs IRIDA Next of which data/metadata to store.

This file can be structured like the following:

{
    "files": { ... },
    "metadata": {
        "samples": {
            ...
        },
    },
}

1.1. Output files

In order to store output files within IRIDA Next, they should be listed in the files section as key/value pairs using the following structure.

{
    "files": {
        "summary": {
            "phylogenetic-tree": "tree.nwk",
        },
        "samples": {
            "SampleA": {
                "assembly_contigs": "output_file.fasta.gz",
                "assembly_qualty": "quality_report.zip",
            },
        },
    },
}

The summary keyword lists output files related to all samples/data in the pipeline (e.g., a phylogenetic tree). The samples keyword lists output files associated with a particular sample (e.g., an assembled genome, etc).

Within each of these sections, there are key-value pairs which will allow access of files for an analysis by the key (e.g., SampleA.assembly_contigs returns output_file.fasta.gz).

1.2. Sample metadata

In order to store sample metadata, it should be structured by the pipeline like the following:

{
    "metadata": {
        "samples": {
            "SampleA": {
                "key1": "value1",
                "key2": {"subkey1": "value1", "subkey2": "value2"},
            },
            "SampleB": {
                "key1": "value2",
            },
        },
    },
}

2. Storage of data in IRIDA Next

2.1. Metadata

Metadata will be stored in IRIDA Next by loading the output.json file and looking for the metadata.samples section. It will store the information associated with each respective sample (e.g., the "SampleA" part below will be used to lookup "SampleA" in IRIDA Next, and the contents of the JSON dictionary will be merged with any existing metadata for SampleA).

"SampleA": {
    "key1": "value1",
    "key2": {"subkey1": "value1", "subkey2": "value2"},
},
"SampleB": {
    "key1": "value2",
},

There will be a parallel table which stores metadata about the source of each above field:

"SampleA": {
    "key1": {
        "source": "analysis",
        "source_id": "1234",
    },
    "key2": {
        "source": "analysis",
        "source_id": "1234",
    },
},
"SampleB": {
    "key1": {
        "source": "analysis",
        "source_id": "1234",
    },
},
@apetkau
Copy link
Member Author

apetkau commented Aug 21, 2023

Idea for defining output files

One idea for defining output files listed in the files section of the output.json is to provide support for Nextflow tower's tower.yml file defining reports to visualize: https://help.tower.nf/22.2/reports/overview/#limitations

@apetkau
Copy link
Member Author

apetkau commented Aug 21, 2023

1.1. Output files (comment)

One comment on output files, some pipelines may want output files associated with other divisions than samples. For example, wg/cgMLST clustering may want to associate output clusters (a tree and other information) with each MLST scheme passed to the pipeline. As in this pipeline phac-nml/nf-pipelines#1, which breaks up analysis results by MLST profile.

One option here is to just reserve the words "summary" and "samples", but provide any other divisions here. For example:

{
    "files": {
        "summary": { ... },
        "samples": { ... },
        "mlst_profiles": {
            "listeria_cgmlst": {
                "clusters": "clusters.text.gz"
            }
        }
    }
}

The clusters output file for this particular analysis pipeline could then be accessed by mlst_profiles.listeria_cgmlst.clusters.

@apetkau
Copy link
Member Author

apetkau commented Nov 3, 2023

Added this in #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant