Ideas for output data of a pipeline #4

apetkau · 2023-08-21T15:56:04Z

1. Output JSON file

In order for pipeline results to be loaded by IRIDA Next, an output.json (or output.json.gz) file should be produced that informs IRIDA Next of which data/metadata to store.

This file can be structured like the following:

{
    "files": { ... },
    "metadata": {
        "samples": {
            ...
        },
    },
}

1.1. Output files

In order to store output files within IRIDA Next, they should be listed in the files section as key/value pairs using the following structure.

{
    "files": {
        "summary": {
            "phylogenetic-tree": "tree.nwk",
        },
        "samples": {
            "SampleA": {
                "assembly_contigs": "output_file.fasta.gz",
                "assembly_qualty": "quality_report.zip",
            },
        },
    },
}

The summary keyword lists output files related to all samples/data in the pipeline (e.g., a phylogenetic tree). The samples keyword lists output files associated with a particular sample (e.g., an assembled genome, etc).

Within each of these sections, there are key-value pairs which will allow access of files for an analysis by the key (e.g., SampleA.assembly_contigs returns output_file.fasta.gz).

1.2. Sample metadata

In order to store sample metadata, it should be structured by the pipeline like the following:

{
    "metadata": {
        "samples": {
            "SampleA": {
                "key1": "value1",
                "key2": {"subkey1": "value1", "subkey2": "value2"},
            },
            "SampleB": {
                "key1": "value2",
            },
        },
    },
}

2. Storage of data in IRIDA Next

2.1. Metadata

Metadata will be stored in IRIDA Next by loading the output.json file and looking for the metadata.samples section. It will store the information associated with each respective sample (e.g., the "SampleA" part below will be used to lookup "SampleA" in IRIDA Next, and the contents of the JSON dictionary will be merged with any existing metadata for SampleA).

"SampleA": {
    "key1": "value1",
    "key2": {"subkey1": "value1", "subkey2": "value2"},
},
"SampleB": {
    "key1": "value2",
},

There will be a parallel table which stores metadata about the source of each above field:

"SampleA": {
    "key1": {
        "source": "analysis",
        "source_id": "1234",
    },
    "key2": {
        "source": "analysis",
        "source_id": "1234",
    },
},
"SampleB": {
    "key1": {
        "source": "analysis",
        "source_id": "1234",
    },
},

The text was updated successfully, but these errors were encountered:

apetkau · 2023-08-21T16:15:48Z

Idea for defining output files

One idea for defining output files listed in the files section of the output.json is to provide support for Nextflow tower's tower.yml file defining reports to visualize: https://help.tower.nf/22.2/reports/overview/#limitations

apetkau · 2023-08-21T18:37:03Z

1.1. Output files (comment)

One comment on output files, some pipelines may want output files associated with other divisions than samples. For example, wg/cgMLST clustering may want to associate output clusters (a tree and other information) with each MLST scheme passed to the pipeline. As in this pipeline phac-nml/nf-pipelines#1, which breaks up analysis results by MLST profile.

One option here is to just reserve the words "summary" and "samples", but provide any other divisions here. For example:

{
    "files": {
        "summary": { ... },
        "samples": { ... },
        "mlst_profiles": {
            "listeria_cgmlst": {
                "clusters": "clusters.text.gz"
            }
        }
    }
}

The clusters output file for this particular analysis pipeline could then be accessed by mlst_profiles.listeria_cgmlst.clusters.

apetkau · 2023-11-03T15:29:10Z

Added this in #7

apetkau mentioned this issue Aug 29, 2023

Ideas for integration of pipelines with IRIDA Next #5

Closed

apetkau mentioned this issue Nov 3, 2023

Updated readme with output structure #7

Merged

apetkau closed this as completed Nov 3, 2023

apetkau added irida-next-integration output labels Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for output data of a pipeline #4

Ideas for output data of a pipeline #4

apetkau commented Aug 21, 2023 •

edited

Loading

apetkau commented Aug 21, 2023

apetkau commented Aug 21, 2023 •

edited

Loading

apetkau commented Nov 3, 2023

Ideas for output data of a pipeline #4

Ideas for output data of a pipeline #4

Comments

apetkau commented Aug 21, 2023 • edited Loading

1. Output JSON file

1.1. Output files

1.2. Sample metadata

2. Storage of data in IRIDA Next

2.1. Metadata

apetkau commented Aug 21, 2023

Idea for defining output files

apetkau commented Aug 21, 2023 • edited Loading

1.1. Output files (comment)

apetkau commented Nov 3, 2023

apetkau commented Aug 21, 2023 •

edited

Loading

apetkau commented Aug 21, 2023 •

edited

Loading