Current version: 1.0.1-dev
A simple Snakemake workflow to perform network analyses using the Python Geneset Network Analysis (PyGNA) package.
- Viola Fanfani, [email protected] (lead developer)
- Giovanni Stracquadanio, [email protected]
If you simply want to use this workflow, download and extract the latest release.
In any case, if you use this workflow in a paper, please cite our PyGNA as follows:
Configure the workflow according to your needs, via editing the file config.yaml
.
Now a template config file is config_temp.yaml
.
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Results are stored in the results
folder.
We provide a full pipeline for the benchmark of GNT and GNA methods through the SBM generated data.
First check the config_sbm.yaml
configfile and use the desired parameters, then the whole pipeline can be run with:
snakemake --snakefile Snakefile_sbm --configfile config_sbm.yaml --cores 1
For the HDN simulations please refere to
the previous paper snakefile Snakefile_paper_old
The entire pipeline for the processing of cancer data can be run as follows
Please note: the scripts/tcga-download.R
script downloads and processes the RNAseq dataset. This step is time and memory consuming, but, most importantly, we have noticed that it is difficult to be able to replicate the exact environment/TCGA version being installed (there are many issues of this kind raised on the TCGAbiolinks github repo ). For reproducibility, we provide the differential expression tables already preprocessed. We have generated this data with R=4 and TCGAbiolinks v2.14.0.
use the --use-conda
flag to install the environment with the same parameters
Following the next steps you should be able to run the paper pipelines:
-
First download the data folder from Zenodo
-
The pipeline uses the data folder path, you can:
A. add the data folder inside the workflow-pygna folder B. change the relative path in the config file -
Check the
config_paper.yaml
configuration files. They include number of permutations and cores parameters, tweak them as needed (for the moment we have set 3 and 1 so that you can quickly check if all results are generated. ). -
As we provide intermediate files whose generation can be very time consuming (differential expression results and SP/RWR matrices), run
snakemake --snakefile Snakefile_paper --configfile config_paper.yaml -t
to update all files time tag and avoid recreating them (snakemake would try to run the pipeline rule again if the files have been recently modified).
We provide a Snakefile to replicate the results of the paper:
snakemake --snakefile Snakefile_paper --configfile config_paper --use-conda --cores $N
- single_geneset: with this suffix we refer to all the results of for the high-throughput experiments taken from TCGA biolinks (Fig. 2).
Please note: thescripts/tcga_rnaseq.R
script downloads and processes the BLCA RNAseq dataset. This step is time and memory consuming, but, most importantly, we have noticed that it is difficult to be able to replicate the exact environment/TCGA version being installed (there are many issues of this kind raised on the TCGAbiolinks github repo ). For reproducibility, we provide the blca_diffexp.csv file that contains the full differential expression results. We have generated this data with R=3.6 and TCGAbiolinks v2.14.0. use the--use-conda
flag to install the environment with the same parameters - multi: refers to the analysis of multiple geneset from the Bailey et al. paper (Fig. 4).
Following the next steps you should be able to run the paper pipelines:
-
First download the data folder from Zenodo
-
The pipeline uses the data folder path, you can:
A. add the data folder inside the workflow-pygna folder B. change the relative path in the config file -
Check the
config_paper_single.yaml
andconfig_paper_multi.yaml
configuration files. They include number of permutations and cores parameters, tweak them as needed (for the moment we have set 3 and 1 so that you can quickly check if all results are generated. ). -
As we provide intermediate files whose generation can be very time consuming (differential expression results and SP/RWR matrices), run
snakemake --snakefile Snakefile_paper <analysis>_all --configfile config_paper_<analysis>.yaml -t
with
<analisys>
being one of single or multi, to update all files time tag and avoid recreating them (snakemake would try to run the pipeline rule again if the files have been recently modified). -
To obtain all the results for the single geneset (avoid the first step to have the full regeneration of all files):
snakemake snakemake --snakefile Snakefile_paper single_all --configfile config_paper_single.yaml -t snakemake --snakefile Snakefile_paper single_all --configfile config_paper_single.yaml --use-conda
-
To obtain the results for the multi geneset
snakemake snakemake --snakefile Snakefile_paper multi_all --configfile config_paper_multi.yaml -t snakemake --snakefile Snakefile_paper multi_all --configfile config_paper_multi.yaml
-
To obtain the results for the hdn simulations
snakemake --snakefile Snakefile_paper hdn_all --configfile config_paper_hdn.yaml
-
To obtain the results for the multi geneset
snakemake --snakefile Snakefile_paper sbm_all --configfile config_paper_sbm.yaml