You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main input will be a collection of allelic profiles passed as a CSV file of allelic profiles the --input parameter. This CSV file will be structured as follows:
profilesheet.csv:
id
profiles_format
allele_profiles
profile_identifier1
csv
/path/to/listeria.allele_profiles
profile_identifier2
parquet
/path/to/salmonella.allele_profiles
The following will be valid fields for the input.
id: An identifier for the allelic profiles, could be the (cg/wg)MLST scheme name, or some other identifier.
profiles_format: The format of the profiles. One of csv or parquet (could be auto-detected from file extension or data as well).
allele_profiles: The allele profiles file, as either a CSV file, or a parquet file.
2.1.1. Allele profiles (CSV)
The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).
id
loci1
loci2
...
lociN
SampleA
be76
af5d
ce78
d877a
ID10
af5d
be76
?
d877a
Missing data will be represented as: ?, 0, - or space.
Other format structures described in
3. Steps
3.1. Deduplication
If the allele profiles referenced by the profilesheet.csv are identified by samples, then deduplication may be required to collapse samples with identical profiles.
3.1.1. Input
The input is an allele profile file like the following.
id
loci1
loci2
...
lociN
SampleA
be76
af5d
ce78
d877a
SampleB
be76
af5d
ce78
d877a
3.1.2. Output
The output consists of a deduplicated profiles file and a mapping back to the original profiles.
The idea is that every sample will have stored the metadata under "SampleX" in data storage, which could then be accessed under listeria_cgmlst.address.
Question: Would it be better to store the deduplicated profile identifier mapped to a sample here, which will be a smaller set of data, and handle expanding to individual samples elsewhere?
5. Integration of data with IRIDA Next
In order for IRIDA Next to load results, it will look for the output.json file as described in Section 4.
5.1. Storing files
Anything under the files section will be stored in IRIDA associated with the analysis pipeline execution. These will be accessible by the key in the files section, for example clusters will give the file identifier.clusters.text.
5.2. Storing sample metadata
Sample metadata will be loaded up and associated with samples. For every sample identfied in the metadata.samples section, the associated metadata will be stored.
1. Purpose
The nomenclature creation pipeline will generate a nomenclature from a collection of (cg/wg)MLST allelic profiles. Will make use of https://github.com/phac-nml/genomic_address_service.
2. Input
2.1. Allelic profiles
The main input will be a collection of allelic profiles passed as a CSV file of allelic profiles the
--input
parameter. This CSV file will be structured as follows:profilesheet.csv:
The following will be valid fields for the input.
csv
orparquet
(could be auto-detected from file extension or data as well).2.1.1. Allele profiles (CSV)
The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).
Missing data will be represented as: ?, 0, - or space.
Other format structures described in
3. Steps
3.1. Deduplication
If the allele profiles referenced by the profilesheet.csv are identified by samples, then deduplication may be required to collapse samples with identical profiles.
3.1.1. Input
The input is an allele profile file like the following.
3.1.2. Output
The output consists of a deduplicated profiles file and a mapping back to the original profiles.
profiles.deduplicated.csv
profiles.samples.json:
3.2. Construction of distance matrix
For every entry in the profilesheet.csv, a separate distance matrix will be constructed. This will make use of https://github.com/phac-nml/profile_dists
3.3. Creation of nomenclature data
For the created distance matrices, a collection of nomenclature data will be created using https://github.com/phac-nml/genomic_address_service
3.4. Creation of output metadata (
output.json
)This step creates the
output.json
file from the nomenclature file.4. Output
The following output will be provided. This will be communicated with an
output.json
file with the following larger structure:4.1. Output files
The
output.json
data for files (the"files"
section defined above) will look like:Where "identifier1" is derived from the identifiers in the `profilessheet.csv".
The output files consist of (output of https://github.com/phac-nml/genomic_address_service):
Here
${identifier}
is derived from the inputprofilessheet.csv
.4.2. Output metadata
The following metadata will be provided:
The idea is that every sample will have stored the metadata under "SampleX" in data storage, which could then be accessed under listeria_cgmlst.address.
Question: Would it be better to store the deduplicated profile identifier mapped to a sample here, which will be a smaller set of data, and handle expanding to individual samples elsewhere?
5. Integration of data with IRIDA Next
In order for IRIDA Next to load results, it will look for the
output.json
file as described in Section 4.5.1. Storing files
Anything under the
files
section will be stored in IRIDA associated with the analysis pipeline execution. These will be accessible by the key in thefiles
section, for exampleclusters
will give the fileidentifier.clusters.text
.5.2. Storing sample metadata
Sample metadata will be loaded up and associated with samples. For every sample identfied in the
metadata.samples
section, the associated metadata will be stored.In IRIDA Next, there will be a parallel table that stores pipeline execution metadata for each field. For example:
The text was updated successfully, but these errors were encountered: