Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nomenclature creation pipeline #1

Open
apetkau opened this issue Aug 19, 2023 · 1 comment
Open

Add nomenclature creation pipeline #1

apetkau opened this issue Aug 19, 2023 · 1 comment
Labels
pipeline An issue describing a pipeline

Comments

@apetkau
Copy link
Member

apetkau commented Aug 19, 2023

1. Purpose

The nomenclature creation pipeline will generate a nomenclature from a collection of (cg/wg)MLST allelic profiles. Will make use of https://github.com/phac-nml/genomic_address_service.

2. Input

2.1. Allelic profiles

The main input will be a collection of allelic profiles passed as a CSV file of allelic profiles the --input parameter. This CSV file will be structured as follows:

profilesheet.csv:

id profiles_format allele_profiles
profile_identifier1 csv /path/to/listeria.allele_profiles
profile_identifier2 parquet /path/to/salmonella.allele_profiles

The following will be valid fields for the input.

  • id: An identifier for the allelic profiles, could be the (cg/wg)MLST scheme name, or some other identifier.
  • profiles_format: The format of the profiles. One of csv or parquet (could be auto-detected from file extension or data as well).
  • allele_profiles: The allele profiles file, as either a CSV file, or a parquet file.

2.1.1. Allele profiles (CSV)

The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).

id loci1 loci2 ... lociN
SampleA be76 af5d ce78 d877a
ID10 af5d be76 ? d877a

Missing data will be represented as: ?, 0, - or space.

Other format structures described in

3. Steps

3.1. Deduplication

If the allele profiles referenced by the profilesheet.csv are identified by samples, then deduplication may be required to collapse samples with identical profiles.

3.1.1. Input

The input is an allele profile file like the following.

id loci1 loci2 ... lociN
SampleA be76 af5d ce78 d877a
SampleB be76 af5d ce78 d877a

3.1.2. Output

The output consists of a deduplicated profiles file and a mapping back to the original profiles.

profiles.deduplicated.csv

id loci1 loci2 ... lociN
123abc be76 af5d ce78 d877a

profiles.samples.json:

{
    "123abc": ["SampleA", "SampleB"],
}

3.2. Construction of distance matrix

For every entry in the profilesheet.csv, a separate distance matrix will be constructed. This will make use of https://github.com/phac-nml/profile_dists

3.3. Creation of nomenclature data

For the created distance matrices, a collection of nomenclature data will be created using https://github.com/phac-nml/genomic_address_service

3.4. Creation of output metadata (output.json)

This step creates the output.json file from the nomenclature file.

4. Output

The following output will be provided. This will be communicated with an output.json file with the following larger structure:

{
    "files": { ... },
    "metadata": { ... }
}

4.1. Output files

The output.json data for files (the "files" section defined above) will look like:

{
    "profile_identifier1": {
        "distances": "identifier.distances.text",
        "thresholds": "identifier.thresholds.json",
        "clusters": "identifier.clusters.text",
        "tree": "identifier.tree.newick",
        "run": "identifier.run.json",
    },
    "profile_identifier2": { ... },
}

Where "identifier1" is derived from the identifiers in the `profilessheet.csv".

The output files consist of (output of https://github.com/phac-nml/genomic_address_service):

  1. ${identifier}.distances.{text|parquet} - Three column file of [query_id, ref_if, distance]
  2. {identifier}.thresholds.json - JSON formated mapping of columns to distance thresholds
  3. ${identifier}.clusters.{text|parquet} - Either symmetric distance matrix or three column file of [query_id, ref_if, distance]
  4. ${identifier}.tree.newick - Newick formatted dendrogram of the linkage matrix produced by SciPy
  5. ${identifier}.run.json - Contains logging information for the run including parameters, newick tree, and threshold mapping info

Here ${identifier} is derived from the input profilessheet.csv.

4.2. Output metadata

The following metadata will be provided:

{
    "files": { ... },

    "metadata": {
        "samples": {
            "SampleA": {
                "listeria_cgmlst": {
                    "address": "1.2.3",
            },
            "SampleB": {
                "salmonella_cgmlst": {
                    "address": "5.9.4",
                }
            }
        }
    }
}

The idea is that every sample will have stored the metadata under "SampleX" in data storage, which could then be accessed under listeria_cgmlst.address.

Question: Would it be better to store the deduplicated profile identifier mapped to a sample here, which will be a smaller set of data, and handle expanding to individual samples elsewhere?

5. Integration of data with IRIDA Next

In order for IRIDA Next to load results, it will look for the output.json file as described in Section 4.

5.1. Storing files

Anything under the files section will be stored in IRIDA associated with the analysis pipeline execution. These will be accessible by the key in the files section, for example clusters will give the file identifier.clusters.text.

5.2. Storing sample metadata

Sample metadata will be loaded up and associated with samples. For every sample identfied in the metadata.samples section, the associated metadata will be stored.

{
    "SampleA": {
        "listeria_cgmlst": {
            "address": "1.2.3",
        },
}

In IRIDA Next, there will be a parallel table that stores pipeline execution metadata for each field. For example:

{
    "SampleA": {
        "listeria_cgmlst": {
            "source": "analysis",
            "source_id": "1234",
        },
    },
}
@apetkau apetkau added the pipeline An issue describing a pipeline label Aug 19, 2023
@apetkau
Copy link
Member Author

apetkau commented Aug 20, 2023

Initial implementation of pipeline here https://github.com/apetkau/nf-core-genomicnomenclature

You can run the pipeline tests with (assuming you have Nextflow and Docker installed):

nextflow run apetkau/nf-core-genomicnomenclature -profile docker,test -r dev -latest --outdir results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline An issue describing a pipeline
Projects
None yet
Development

No branches or pull requests

1 participant