Skip to content

3. Considerations and Underlying Assumptions

Rauf Salamzade edited this page Dec 18, 2021 · 2 revisions

On this page we list some of the important factors to consider prior to running the suite. Manual examination of segments and quality assessment of their presence is within the genomes of individual sample's should be performed prior to drawing conclusions!

Recently Developed Alternative Approaches for Finding Conserved Segments

Related methods for finding highly conserved and contiguous sequences have recently been described by:

In Arevalo et al. 2019, the authors use such segments to infer ecological relationships between genomes. They utilize the MUGSI suite for perform multi-genome whole genome alignment (WGA). While MUGSI offers significant boosts in efficiency compared to some alternate options for WGA, such WGA approaches are largely limited in their ability to scale well to large numbers (>100) of genomes/scaffolds without extensive memory requirements.

In Evans et al. 2020, the authors looked for identical >= 5 kb segments conserved in multiple genera from a single hospital network and assumed such segments represented localized spread. Their method relied on pairwise BLASTn analysis followed by single-linkage clustering. Critically, their approach could not identify clear boundaries of conserved segments found in more than 2 samples/scaffolds.

Additionally, both approaches do not have added functionalities to account for sequence circularity and continuation breaks natural to FASTA formatting.

Appropriate Window Size and Step for Sliding Window Analysis

It is absolutely critical to choose the window size and step wisely for the sliding window analysis in delineateSegmentsOnReference.py to avoid false positive detection of conserved segments due to repetitive multi-copy elements. While window size doesn't impact the ability to detect whether an individual window is conserved, it does impact whether adjacent windows can be appropriately combined when delineating conserved segments. Thus, larger window sizes and smaller (more granular) steps for the sliding window analysis will result in more reliable detection of conserved segments.

For our study (Salamzade et al 2021), we used a 10 kb window size with a 100 bp step, because:

  1. The vast majority (98.7%) of HSPs with identity > 99% were shorter; thus, 10 kb and longer sequences were outliers and hypothesized to share a recent ancestral origin

  2. Similarly, windows of 10 kb are extremely unlikely to thus occur multiple times in the same scaffold/genome.

  3. It allowed for the capture of whole operons or large transposable elements and their surrounding contexts, such as Tn4401.

Impact of Input Assembly Quality

The more fragmented input assemblies and scaffolds are, the more likely conserved segments will be missed or predicted to be shorter than they actually are. We ran ConSequences on a set of illumina only draft genome assemblies for 12 strains and separately on high quality polished reference assemblies, created from hybrid assembly with illumina and nanopore sequencing data, for the same strains. The number and size of conserved segments greatly benefited from using the higher quality assemblies.

It is also important to recognize that circular sequences are linearized when depicted in FASTA format. To overcome this factor, delineateSegmentsOnReference.py allows for the user to specify whether the reference scaffold or alternate scaffolds represent complete circular molecules. Short circular molecules could be fully assembled even in draft genomic assemblies! We thus also developed and included the program findCircularScaffolds.py for predicting whether scaffolds represent complete circular plasmids inspired by methods developed and described previously by HJoregensen et al. 2104 and Kothari et al. 2018.

Considerations for Finding Segment Instances in a Sample's Readset

As mentioned, querySegmentsInRawReads.py only predicts that a segment is present by checking that k-mers comprising the core of the segment's MSA. This essentially checks when an assembly path is possible for the segment in the sample's read set. There could be complications in predictions and not all k-mers along the reference MSA are accounted for, such as those which relate to small indels in the MSA.

In our study (Salamzade et al 2021), we predicted 52 new instances across 44 geographic signatures, but based on manual examination to check how segments align to the draft genome assemblies of the samples with new instances, we fell confident in including only the instances which were missed in assemblies by delineateSegmentsOnReference.py because the segment was on a scaffold end (assembly fragmentation) or because they were located within chromosomal scaffolds, not included in our initial scan for conserved segments.

Why Pilon and ARIBA are not Ideal for Predicting Conserved and Contiguous Segments are Present in a Sample's Readset?

Before developing the k-mer based methodology for predicting the presence of conserved segments directly in a sample's readset, we attempted to use available informatics tools we thought could do the job. Namely, we tried to use Pilon to align readsets to conserved segments and ARIBA to try targeted, mini de novo assembly of conserved segments from readsets. However, both these approaches exhibited difficulties in accurately predicting whether segments, in particular those which featured short IS elements found in multiple copies throughout isolate genomes, were present in a sample contiguously. In other words, they did a great job at predicting whether all the parts of a segment were present but struggled in determining whether the segment was contiguously present in the genome of a sample. Pilon was extremely informative; however, its philosophy is conservative for the task at hand and designed to test whether the conserved segment (used as a reference) could have potential structural errors rather than directly reporting whether the conserved segment can simply be assembled from a sample's readset.

The k-mer methodology maintains its accuracy when dealing with rare segments, often representative of unique gene arrangements. It roots out false positive detection of conserved segments by searching for core k-mers spanning unique junctions between common DNA fragments, often originating from MGEs. For instance, in Salamzade et al. 2021, we identified six carbapenemase carrying multi-species geographic signatures, one of which included an overlap between the Tn4401 transposon carrying the blaKPC carbapenemase and an independent Tn5403 insertion sequence. Because the k-mers spanning this overlap between the two insertion elements are core to all instances of the conserved signature identified from assembly analysis, new instances of the segment identified in raw reads must also feature these k-mers. Using the k-mer method, we identified only 12 samples as carrying this signature across the 600+ samples in our study and the full SRA database (until Dec 2016), including 5 samples which were not initially used to identify the signature from assemblies. To further investigate this signature, we also constructed completed hybrid genomic assemblies using ONT technologies and were able to validate that all twelve samples indeed featured the signature.