Skip to content

Commit

Permalink
Updated pathotyping and toxin typing database by adding protein seque…
Browse files Browse the repository at this point in the history
…nces to each record and adding subunit coordinates and intergenic region of shiga toxin records
  • Loading branch information
kbessonov1984 committed Oct 3, 2024
1 parent b420a54 commit 776c2fb
Show file tree
Hide file tree
Showing 3 changed files with 3,166 additions and 625 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,8 +246,7 @@ The Shiga toxin subtyping module supports typing of the *`stx1`* and *`stx2`* ge

Currently the database supports 4 *`stx1`* subtypes: *`stx1a`*, *`stx1c`*, *`stx1d`* and stx1e and 15 *`stx2`* subtypes: *`stx2a`*, *`stx2b`*, *`stx2c`*, *`stx2d`*, *`stx2e`*, *`stx2f`*, *`stx2g`* ,*`stx2h`*, *`stx2i`*, *`stx2j`*, *`stx2k`*,*`stx2l`*, *`stx2m`*, *`stx2n`*, *`stx2o`*.

The input sequences are queried against the *`stx1`* and *`stx2`* markers via BLASTN and top hits are being reported separated by the `;` symbol. The module supports the multi-copy `stx` gene presence by taking into account the genomic `stx` location attributes for each `stx` subtype (i.e. gene coordinates, contig location, overlap with other `stx` hits). The multi-copy `stx` gene reporting is not exhaustive (not all hits are being reported). That is if multiple `stx` hits are found in the input, the highest quality longest hit per each `stx` subtype is being reported (i.e. the hit with the highest `bitscore`). The `StxSubtypes` field lists all UNIQUE `stx` subtypes such as `stx2e;stx2k` even if their genomic locations overlap or are identical due to truncated incomplete `stx` alleles. The `StxContigNames` and `StxCoordinates` lists all contig names and corresponding genomic coordinates for each listed `stx` type in the `StxSubtypes` field according to the alphabetical order.

The input sequences are queried against the *`stx1`* and *`stx2`* markers via BLASTN and top hits are being reported separated by the `;` symbol. The module supports the multi-copy `stx` gene presence by taking into account the genomic `stx` location attributes for each `stx` subtype (i.e. gene coordinates, contig location, overlap with other `stx` hits). The multi-copy `stx` gene reporting is not exhaustive (not all hits are being reported). That is if multiple `stx` hits are found in the input, the highest quality hit(s) per each non-overlapping `stx` gene range is being reported (i.e. single or multiple top hits are possible with the highest identical `bitscore` value as some hits could not be resolved due to sequence truncation). For example, if several `stx` allele hits have identical `bitscore` in a given `stx` gene range, all such hits are being reported. Note that the `StxSubtypes` field lists only UNIQUE `stx` subtypes for the entire input sample such as `stx2e;stx2k` even if their genomic locations overlap or are identical due to truncated incomplete `stx` allele signatures. The `StxContigNames` and `StxCoordinates` lists all contig names and corresponding genomic coordinates for each listed `stx` type in the `StxSubtypes` field according to the alphabetical order. This allows to easily spot `stx` subtypes with the same genomic coordinates. Finally, these fields allow to better understand `stx` alleles context/function and spot truncated alleles while providing genomic location context.

### Quality Control (QC) module
To provide an easier interpretation of the results and typing metrics, following QC codes were developed.
Expand Down
Loading

0 comments on commit 776c2fb

Please sign in to comment.