RNAfold_virus

Tutor (Neri's) version of Rima Sghayer's lab project.
The repo contains the scripts (mostly as jupyter notebooks) needed to recreate the distribution and statistical analyses of RNA seconadry structure features of several RNA virus groups.
The notebooks can be ran locally (we recommend using jupyter-lab so it would be easier to skip between notebooks) or via Google Colab (just note that some paths would need to be adjusted, and that the enviroment and some of the dependencies would need to be re-installed on each startup).
For more information, take a look at this confrence poster: Slides-link or just contact us directly.

Order of execution

Preprocessing.ipynb - Downloads and installs the dependencies, sets the working enviroment, fetches sequence data for selected virus groups from the RVMT project's FTP hub. In the last code block, it executes RNAfold on the length filtered contigs - this step may take some time, so we recommend using the pre-generated files instead (in this repo, see the /RAW_Data/<virus_group>/ subfolders.
Data_extraction.ipynb - Sources the Contig class and some helper functions to parse the DBN files using the forgi library. Last code block iterates over the different virus groups and dumps a pickled version of the object for each contig.
statistical analysis.ipynb - Focuses on statistical tests, disribution analysis, data exploration and visualization of the extracted features in the different virus groups. The visualizations produced in this step help in understanding the statistical patterns and in interpreting the statistical results.

Dependencies

ViennaRNA
seqkit
Python libraries:
- itables
- pingouin
- matplotlib
- pandas
- scipy
- numpy
- joblib
- multiprocessing
GNU parallel Needed to run multiple parallel instances of RNAfold instead of a single instance using the internal threading option (--jobs=0).
bbtools/bbmap Optional, not in use yet - may be added later for better pre-processing stats (via bbstats.sh) and adding the option to use length cutoff >= median of group instead of xx% of longest contig length (may help with viral groups with highly variable genome length).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Raw_Data		Raw_Data
__pycache__		__pycache__
pickles		pickles
Contig_class.py		Contig_class.py
Data_extraction.ipynb		Data_extraction.ipynb
Normalized_Structural_Element.tsv		Normalized_Structural_Element.tsv
Normalized_Structural_Element_with_lengths.tsv		Normalized_Structural_Element_with_lengths.tsv
Orange_ML_target_taxa.ows		Orange_ML_target_taxa.ows
Orange_MLs.ows		Orange_MLs.ows
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
VMR_MSL38_v1.xls		VMR_MSL38_v1.xls
VMR_MSL38_v1.xlsx		VMR_MSL38_v1.xlsx
VoM23_poster_Neri.pdf		VoM23_poster_Neri.pdf
new-workspace.jupyterlab-workspace		new-workspace.jupyterlab-workspace
ows_flowchart.png		ows_flowchart.png
statistical_analysis_genome_polarity.ipynb		statistical_analysis_genome_polarity.ipynb
statistical_analysis_taxa.ipynb		statistical_analysis_taxa.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNAfold_virus

Order of execution

Dependencies

About

Releases

Packages

UriNeri/RNAfold_virus_Rima

Folders and files

Latest commit

History

Repository files navigation

RNAfold_virus

Order of execution

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages