This repository is open sourced as specified in the LICENSE file. It is Apache License 2.0. For additional information, please check LICENSE.
CROPSR is a python tool designed for genome-wide gRNA design and evaluation for CRISPR experiments, with special focus on complex genomes such as those found in energy-producing crops. CROPSR is a product of the DOE Center for Advanced Bioenergy and Bioproducts Innovation (CABBI).
Please cite the following when utilizing CROPSR:
Müller Paul, H., Istanto, D.D., Heldenbrand, J. et al. CROPSR: an automated platform for complex genome-wide CRISPR gRNA design and validation. BMC Bioinformatics 23, 74 (2022). https://doi.org/10.1186/s12859-022-04593-2.
Table of Contents
CROPSR does not require a separate Python environment for dependency management, as there are few dependencies and the code is updated to their current versions. Any required changes will be made to maintain compatibility with current versions of these dependencies. CROPSR is intended to be used on Python 3.7 or newer (newest stable recommended).
- Python version: 3.7 or newer
- Python libraries:
- pandas
- argparse
- If you have both Python 2.7 and Python 3 installed (e.g. Ubuntu 18.10 or older, MacOS Catalina or older), use pip3 to install the libraries to the correct path for Python 3:
$ pip3 install pandas argparse
- If you have only Python 3 installed, or Python 3 is your default (e.g. Ubuntu 20.04 or newer, MacOS Big Sur or newer), the default instalation with :
$ pip install pandas argparse
To perform a full genome analysis, the following two files are required:
- Fasta file containing whole genome sequence
- GFF file containing genome functional annotation
Imoprtant note for genomes downloaded from Phytozome:
If your genome comes from Phytozome, please make sure to also download the annotation_info.txt file. This is important because Phytozome GFF files contain a reference for a location within a separate file, so their genome browser can display the functional annotation as a separate layer. Without this file, CROPSR will output a database with no functional annotation when processing a Phytozome genome.
CROPSR was developed as a CLI software, and requires a basic understanding of bash (or equivalent).
-
Download the project folder.
$ git clone [email protected]:cabbi-bio/CROPSR.git
-
Navigate to the project folder.
$ cd CROPSR
-
Place all input files on a folder (This does not need to be the CROPSR folder, as long as all input files are in the same folder. This is especially important for genomes downloaded from Phytozome).
-
Running CROPSR (if Python 3 is set as your default
python
path, replace where it sayspython3
): You can get the CROPSR help prompt by enteringpython3 CROPSR.py -h
will return the list of arguments below:$ python3 CROPSR.py -h
usage: CROPSR.py [-h] -f F -g G -p [-o O] [-l] [-L] [--cas9] [-v] optional arguments: -h, --help show this help message and exit -f , --fasta F [required] path to input file in FASTA format -g , --gff G path to input file in GFF format -p , --phytozome path to input annotation info file in TXT format, default = None -o , --output O path to output file, default = data.csv -l , --length length of the gRNA sequence, default = 20 -L , --flanking length of flanking region for verification, default = 200 --cas9 specifies that design will be made for the Cas9 CRISPR system -v, --verbose prints visual indicators for each iteration
Flag Description -h, --help Quits the code and opens the help prompt -f, --fasta Path to the FASTA file ( *.fasta
,*.fa
) containing the genome sequence (always required)-g, --gff Path to the GFF file ( *.gff
,*.gff3
) containing the functional annotation (always required)-p, --phytozome Path to the annotation_info.txt
file containing functional annotation (required for phytozome genomes)-o, --output Path to save the output database file, including file name. The default is data.csv
at the working directory-l, --length Desired length of the gRNA sequence. The default value is 20
, and this should not be changed unless required by a non-standard Cas protein. Changing this value otherwise may cause the experiment to fail-L, --flanking Desired length of flanking region for designing primers for PCR validation. The default value is 200
bases upstream and downstream of the cutsite--cas9 Type of CRISPR system for the experiment. At least one CRISPR system is required. (Currently only Cas9
is available, but other systems may be implemented in a future version)-v, --verbose Enables verbose mode and prints notes at several points of the process. Enable for debugging
After completion, if verbose
is enabled, a prompt will appear to inform the user that The output file has been generated at example_data/cropsr_output.csv
. No temporary files are generated during the analysis.
CROPSR outputs a CSV (comma separated values) file by default. This file type was chosen due to the ease of handling, including importing it into the database manager of the user's preference. An option to output as a JSON following MongoDB formatting is also provided, requiring the pymongo
library as an additional dependency.
Click here for MongoDB dependency instructions
- If you have both Python 2.7 and Python 3 installed (e.g. Ubuntu 18.10 or older, MacOS Catalina or older), use pip3 to install the libraries to the correct path for Python 3:
$ pip3 install pymongo
- If you have only Python 3 installed, or Python 3 is your default (e.g. Ubuntu 20.04 or newer, MacOS Big Sur or newer), the default instalation with :
$ pip install pymongo
Invalid/Unactionable gRNA Sequences
In the case that an invalid/unactionable gRNA sequence is generated from the genome FASTA file (i.e. possibly due to sequencing inaccuracies) the sequence will be stored in the output file with an "on_site_score" of -1, such that these sequences can be referenced by the user if desired, but will be excluded from the candidate gRNA database if the user queries limiting (0 ≤ on_site_score ≤ 1).
An example data set is provided to serve as a tutorial. This data set is comprised of the first chromosome of Saccharomyces cerevisiae, and includes both a FASTA
and GFF
files. The entire process will be described below.
The example data set is provided in a folder named sample_data within the contents of this Git repository. To follow along with this tutorial, no additional data should be required.
The data structure of the repository is represented below:
+-- README.md
+-- LICENCE.md
+-- CROPSR.py
+-- cropsr_functions.py
+-- prmrdsgn2.py
+-- sample_data/
+-- sample_genome.fa
+-- sample_genome.gff
+-- .DS_Store
+-- .gitignore
Click here to preview FASTA file sample_genome.fa
>Chr01
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
TGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCAC
CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGG
TCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTcccaaat
attgtataaCTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC
AATATTACAGAAAAATCCCCACAAAAATCacctaaacataaaaatattctacttttcaacaataataCATAAACATATTG
GCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAATATTGCAATTTGCTTGAACGGATGCTATTT
CAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGC
AATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCCATTCCCATATGCTAACCGCA
ATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGAC
GATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATgtcaaataattttacgGTAATATAACTTATCAGCGG
CGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCC
CTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGA
GGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGC
ATATTCTATACGGCCCGACGCGACGCgccaaaaaatgaaaaacgAAGCAGCGactcatttttatttaagGACAAAGGTTG
CGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACA
ATAGTGTagaagtttctttcttatgTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGgtctat
aatattaatatacatttatataaTCTACGGTATttatatcatcaaaaaaaagtagtttttttattttattttgttcgtta
attttcaatttctatGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGAT
AGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGC
AATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAGCAACTTCGACTGGGTAGGTTTCAGTTGGGT
GGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACT
CTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAA
TTTGACCattgtatttgttttgtttgttaGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAA
GGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAG
TTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAA
TCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAA
TGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGacatttcctttttcggtcaaaaagaat
atccGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCG
GAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTT
GTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATAT
CAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTG
AACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTCTT
AAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTTTTCCAAACAATC
TTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACtgccaaatttttcttgctcATTTATAATGATT
GATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGAtgttaaaaaattattattttcttcataaagAAG
CTTTCAAGATATAAGATACGAAATAGGGGTTGATAATTGCATGACAGTAGCTTTAgatcaaaaaggaaagcaTGGAGGGA
AACAGTAAACAGTGAAAATTCTCTTGAGAACCAAAGTAAACCTTCATTGAAGAGCTTCcttaaaaaatttagaaTCTCCC
ATGTCAACGGGTTTCCATACCTCCCCAGCATCATacatcttttttcaaagaaactTCAAATGCCTCTTTTATGCAAGGGG
CAAAATCCTGAAATGACTTAAACTTAGCAGTttcgtcttttttcaaagagaatggttgaagaagaattgtttTGGACGCT
TATTGACAATCTGTTGCATTGATAAAGTACCTACTATCCCAGACTATATTTGTATACAAGTACAAAATTAGGTTTGTTGA
AACAACTTTCCGATCATTGGTGCCCGTATCTGATGTTTTTTTAGTAATTTCTTTGTAAATACAGGGAGTTGTTTCGAAAG
CTTATGAGAAAAATACATGAATGACAGGTAAAAATATTGGCTCGAAAAAGAGGacaaaaagagaaatcaTAAATGAGTAA
ACCCACTTGCTGGACATTATCCAGTAAAGGCTTGGTAGTAACCATAATATTACCCAGGTACGAAACGCTAAGAACTTGAA
AGACTCATAAAACTTCCAGGTTAAgctatttttgaaaatattctgaGGTAAAAGCCATTAAGGTCCAGATAACCAAGGGA
...
Click here to preview GFF file sample_genome.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-3-1
##sequence-region Chr01 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
Chr01 RefSeq region 1 230218 . + . ID=Chr01:1..230218;Dbxref=taxon:559292;Name=I;chromosome=I;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=S288C
Chr01 RefSeq telomere 1 801 . - . ID=id-Chr01:1..801;Dbxref=SGD:S000028862;Note=TEL01L%3B Telomeric region on the left arm of Chromosome I%3B composed of an X element core sequence%2C X element combinatorial repeats%2C and a short terminal stretch of telomeric repeats;gbkey=telomere
Chr01 RefSeq origin_of_replication 707 776 . + . ID=id-Chr01:707..776;Dbxref=SGD:S000121252;Note=ARS102%3B Autonomously Replicating Sequence;gbkey=rep_origin
Chr01 RefSeq gene 1807 2169 . - . ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;end_range=2169,.;gbkey=Gene;gene=PAU8;gene_biotype=protein_coding;locus_tag=YAL068C;partial=true;start_range=.,1807
Chr01 RefSeq mRNA 1807 2169 . - . ID=rna-NM_001180043.1;Parent=gene-YAL068C;Dbxref=GeneID:851229,Genbank:NM_001180043.1;Name=NM_001180043.1;end_range=2169,.;gbkey=mRNA;gene=PAU8;locus_tag=YAL068C;partial=true;product=seripauperin PAU8;start_range=.,1807;transcript_id=NM_001180043.1
Chr01 RefSeq exon 1807 2169 . - . ID=exon-NM_001180043.1-1;Parent=rna-NM_001180043.1;Dbxref=GeneID:851229,Genbank:NM_001180043.1;end_range=2169,.;gbkey=mRNA;gene=PAU8;locus_tag=YAL068C;partial=true;product=seripauperin PAU8;start_range=.,1807;transcript_id=NM_001180043.1
Chr01 RefSeq CDS 1807 2169 . - 0 ID=cds-NP_009332.1;Parent=rna-NM_001180043.1;Dbxref=SGD:S000002142,GeneID:851229,Genbank:NP_009332.1;Name=NP_009332.1;Note=hypothetical protein%3B member of the seripauperin multigene family encoded mainly in subtelomeric regions;experiment=EXISTENCE:mutant phenotype:GO:0030437 ascospore formation [PMID:12586695],EXISTENCE:mutant phenotype:GO:0045944 positive regulation of transcription by RNA polymerase II [PMID:12586695];gbkey=CDS;gene=PAU8;locus_tag=YAL068C;product=seripauperin PAU8;protein_id=NP_009332.1
Chr01 RefSeq gene 2480 2707 . + . ID=gene-YAL067W-A;Dbxref=GeneID:1466426;Name=YAL067W-A;end_range=2707,.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YAL067W-A;partial=true;start_range=.,2480
Chr01 RefSeq mRNA 2480 2707 . + . ID=rna-NM_001184582.1;Parent=gene-YAL067W-A;Dbxref=GeneID:1466426,Genbank:NM_001184582.1;Name=NM_001184582.1;end_range=2707,.;gbkey=mRNA;locus_tag=YAL067W-A;partial=true;product=uncharacterized protein;start_range=.,2480;transcript_id=NM_001184582.1
Chr01 RefSeq exon 2480 2707 . + . ID=exon-NM_001184582.1-1;Parent=rna-NM_001184582.1;Dbxref=GeneID:1466426,Genbank:NM_001184582.1;end_range=2707,.;gbkey=mRNA;locus_tag=YAL067W-A;partial=true;product=uncharacterized protein;start_range=.,2480;transcript_id=NM_001184582.1
Chr01 RefSeq CDS 2480 2707 . + 0 ID=cds-NP_878038.1;Parent=rna-NM_001184582.1;Dbxref=SGD:S000028593,GeneID:1466426,Genbank:NP_878038.1;Name=NP_878038.1;Note=hypothetical protein%3B identified by gene-trapping%2C microarray-based expression analysis%2C and genome-wide homology searching;gbkey=CDS;locus_tag=YAL067W-A;product=uncharacterized protein;protein_id=NP_878038.1
Chr01 RefSeq gene 7235 9016 . - . ID=gene-YAL067C;Dbxref=GeneID:851230;Name=SEO1;end_range=9016,.;gbkey=Gene;gene=SEO1;gene_biotype=protein_coding;locus_tag=YAL067C;partial=true;start_range=.,7235
...
-
Navigate to the project folder.
$ cd CROPSR
-
Run CROPSR with the sample data (code is available below).
$ python3 CROPSR.py -f sample_data/sample_genome.fa -g sample_data/sample_genome.gff -o sample_data/sample_genome_output.csv --cas9 -v
Note that the
verbose
flag was left on. This will cause the terminal to print notifications during the process, however, it means you will not be able to utilize the terminal window until it is finished. If you close the terminal window while the process is running, it will cause an interruption.Click here to learn how to run this process in the background
$ python3 CROPSR.py -f sample_data/sample_genome.fa -g sample_data/sample_genome.gff -o sample_data/sample_genome_output.csv --cas9 &
In this variation, the
verbose
flag was removed, and a&
was added at the end of the command. This will free your terminal window to perform other tasks or be closed. This is the recommended approach when running real data, as the process may take more than a day to finish. Make sure the computer remains powered on for the entirety of the process. -
While CROPSR is running (with the
verbose
flag active), you should see the following appear in your terminal:################################################################################ ## ## ## ## ## .o88b. d8888b. .d88b. d8888b. .d8888. d8888b. ## ## d8P Y8 88 `8D .8P Y8. 88 `8D 88' YP 88 `8D ## ## 8P 88oobY' 88 88 88oodD' `8bo. 88oobY' ## ## 8b 88`8b 88 88 88ººº `Y8b. 88`8b ## ## Y8b d8 88 `88. `8b d8' 88 db 8D 88 `88. ## ## `Y88P' 88 YD `Y88P' 88 `8888Y' 88 YD ## ## ## ## ## ################################################################################ U.S. Dept. of Energy's Center for Advanced Bioenergy and Bioproducts Innovation University of Illinois at Urbana-Champaign You are currently utilizing the following settings: CROPSR version: 1.11b Path to genome file in FASTA format: sample_data/sample_genome.fa Path to output file: sample_data/output.csv Length of the gRNA sequence: 20 Length of flanking region for verification: 200 Number of available CPUs: 12 Path to annotation file in GFF format: /sample_data/sample_genome.gff3 Path to annotation_info file in TXT format: None Designing for CRISPR system: Streptococcus pyogenes Cas9 True Genome file sample_data/sample_genome.fa successfully imported formatting genome Genome file sample_data/sample_genome.fa successfully formatted The genome was successfully converted to a dictionary Annotation file sample_data/sample_genome.gff successfully imported Annotation database successfully generated Initiating PAM site detection. Please wait, this may take a while... 17314 Cas9 PAM sites were found on Chr01 The output file has been generated at sample_data/output.csv
-
After the process is complete, you should have access to the generated output file in
CSV
format.The folder structure should be similar to what is represented below:
+-- README.md +-- LICENCE.md +-- CROPSR.py +-- cropsr_functions.py +-- prmrdsgn2.py +-- sample_data/ +-- sample_genome.fa +-- sample_genome.gff +-- sample_genome_output.csv +-- .DS_Store +-- .gitignore
Click here for a preview of sample_genome_output.csv
crispr_id,crispr_sys,sequence,long_sequence,chromosome,start_pos,end_pos,cutsite,strand,on_site_score,features A01NW7FGPN,cas9,GGUUAGAUUAGGGCUGUGUU,GCCAGGGUUAGAUUAGGGCUGUGUUAGGGU,Chr01,77,97,94,+,0.0388536288320188,,completed A01QLYYDXZ,cas9,GUGCGUACGUAAAAUCAGUA,UCCGUGUGCGUACGUAAAAUCAGUAUACAA,Chr01,411,431,428,+,0.5402860690510768,,completed A01RVDM36X,cas9,GGAGUGAAGUGGAAUCUGAG,GCCAUGGAGUGAAGUGGAAUCUGAGAGUAG,Chr01,471,491,488,+,0.6091206773441749,,completed A011BCL3O8,cas9,GCAUAAUGAUGUGAGUGCAU,GCCGUGCAUAAUGAUGUGAGUGCAUUUGGU,Chr01,521,541,538,+,0.05499563386100348,,completed A01J39N6WJ,cas9,UGAGGCAAGUGCCGUGCAUA,ACCGCUGAGGCAAGUGCCGUGCAUAAUGAU,Chr01,536,556,553,+,0.1199299810564199,,completed A01AG1OHUM,cas9,AUGAGAUAUAGAUAUCAAAA,GCCGAAUGAGAUAUAGAUAUCAAAAUGUGG,Chr01,605,625,622,+,0.7178898545540503,,completed A01HBGMKT6,cas9,CGAAUGAGAUAUAGAUAUCA,ACCGCCGAAUGAGAUAUAGAUAUCAAAAUG,Chr01,608,628,625,+,0.16003821718975292,,completed A01LPJZIOI,cas9,UAUGUUUAUGataataacaa,GCCAAUAUGUUUAUGataataacaactttt,Chr01,777,797,794,+,0.8846303027545066,,completed A019RQ1MNC,cas9,AAGCCAAUAUGUUUAUGata,ACCACAAGCCAAUAUGUUUAUGataataac,Chr01,784,804,801,+,0.2709374516526445,,completed ...
This work was funded by the DOE Center for Advanced Bioenergy and Bioproducts Innovation (U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under Award Number DE-SC0018420). Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the U.S. Department of Energy.