Tutor (Neri's) version of Rima Sghayer's lab project.
The repo contains the scripts (mostly as jupyter notebooks) needed to recreate the distribution and statistical analyses of RNA seconadry structure features of several RNA virus groups.
The notebooks can be ran locally (we recommend using jupyter-lab so it would be easier to skip between notebooks) or via Google Colab (just note that some paths would need to be adjusted, and that the enviroment and some of the dependencies would need to be re-installed on each startup).
For more information, take a look at this confrence poster: Slides-link or just contact us directly.
- Preprocessing.ipynb - Downloads and installs the dependencies, sets the working enviroment, fetches sequence data for selected virus groups from the RVMT project's FTP hub. In the last code block, it executes
RNAfold
on the length filtered contigs - this step may take some time, so we recommend using the pre-generated files instead (in this repo, see the /RAW_Data/<virus_group>/ subfolders. - Data_extraction.ipynb - Sources the
Contig
class and some helper functions to parse the DBN files using theforgi
library. Last code block iterates over the different virus groups and dumps a pickled version of the object for each contig. - statistical analysis.ipynb - Focuses on statistical tests, disribution analysis, data exploration and visualization of the extracted features in the different virus groups. The visualizations produced in this step help in understanding the statistical patterns and in interpreting the statistical results.
-
Python libraries:
-
GNU parallel Needed to run multiple parallel instances of
RNAfold
instead of a single instance using the internal threading option (--jobs=0
). -
bbtools/bbmap Optional, not in use yet - may be added later for better pre-processing stats (via
bbstats.sh
) and adding the option to use length cutoff >= median of group instead of xx% of longest contig length (may help with viral groups with highly variable genome length).