-
Notifications
You must be signed in to change notification settings - Fork 2
Build Database (sh)
This is the most important script in this repository and orchestrates the complete database construction process. All information required to understand what this script consists of and what each step or function does are explained in detail below.
-
DB_TYPES
: An array containing all database types that should be processed by this script. -
DB_SOURCES
: An array containing all database URLs that should be processed by this script. The$i$ 'th URL in this array should correspond with the$i$ 'th database type provided in theDB_TYPES
array.
-
PEPTIDE_MIN_LENGTH
: The minimum length (inclusive) for tryptic peptides. -
PEPTIDE_MAX_LENGTH
: The maximum length (inclusive) for tryptic peptides.
-
OUTPUT_DIR
: Folder in which the final TSV-files should be stored. -
INTDIR
: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times). -
TEMP_DIR
: Folder in which temporary files, required by this script, can be stored.
-
JAVA_MEM
: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously. -
CMD_SORT
: The particular Unixsort
command that should be used (including relevant options). -
CMD_GZIP
: The particular pipe compression command that should be used (including relevant options). -
ENTREZ_BATCH_SIZE
: The size of requests that should be used to communicate with Entrez.
-
TAXON_URL
: URL of a NCBI taxon dump that adheres to the file format described here. -
EC_CLASS_URL
: URL of a file with a listing of EC-numbers on theclass
,subclass
andsubsubclass
level (including their associated name). Must adhere to the file format described here. -
EC_NUMBER_URL
: URL of a file with a listing of EC-numbers on the deepest level of the ontology. Must adhere to the file format described here. -
GO_TERM_URL
: URL of a file with all GO-terms. -
INTERPRO_URL
: URL of a file with all InterPro-entries.
None
Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz
and lineages.tsv.gz
).
-
taxons.tsv.gz
(final output file) -
lineages.tsv.gz
(final output file)
None
This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.
The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.
For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.
This function simply chains filter_sources_by_taxa
and create_most_tables
together.
This function will reproduce all proteins from the input Unipept Database Index on stdout
that are associated with one of the taxa that are provided by the TAXA
variable.
Note that proteins associated with a taxon that is a child of one of the TAXA
variables will also be kept in the output.
-
stdout
: Produces a TSV onstdout
containing one protein per line, according to the format described here.
-
stdin
: Reads a TSV fromstdin
containing one protein per line, according to the format described here. -
taxons.tsv.gz
(final output file)
This function reads in all protein data from stdout
and passes it onto the helper program listed above.
After parsing the TSV from the input, TaxonsUniprots2Tables.jar
will generate a collection of compressed TSV-files that will be used further along this script.
The most important of these is the peptides.tsv.gz
file, which contains a list of all tryptic peptides that are the result of the in-silico tryptic digest that is performed by TaxonsUniprots2Tables.jar
.
-
uniprot_entries.tsv.gz
(final output file) -
ec_cross_references.tsv.gz
(final output file) -
go_cross_references.tsv.gz
(final output file) -
interpro_cross_references.tsv.gz
(final output file) -
peptides.tsv.gz
(intermediary file)
-
peptides.tsv.gz
(intermediary file) -
uniprot_entries.tsv.gz
(final output file)
None
This function starts by creating two FIFO files and writing data to them:
-
peptides_eq: By reading in the intermediary file
peptides.tsv.gz
and extracting only the equalized peptide sequence and the associated UniProt entry ID, we end up with this data forpeptides_eq
(each line consists of the UniProt entry ID and equalized peptide sequence):
000000000001 GGLSVPGPMGPSGPR
000000000001 GLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPR
000000000001 GPPGPPGK
-
entries_eq: By reading in the final output file
uniprot_entries.tsv.gz
and extracting only the UniProt entry ID and the associated NCBI taxon ID, we end up with this data forentries_eq
(each line consists of the UniProt entry ID and the NCBI taxon ID):
000000000001 2546662
000000000002 2546656
000000000003 2591764
000000000004 2546663
000000000005 2587412
- The third step in the implementation of this function takes care of joining both of the files generated above by UniProt entry ID (effectively coupling equalized peptide sequences and NCBI taxa IDs) and then sorting the output file by peptide sequence. The final rresult is written to the intermediary file
aa_sequence_taxon_equalized.tsv.gz
:
AAAAA 1063
AAAAA 243160
AAAAA 271848
AAAAA 272560
AAAAA 31716
AAAAA 32037
-
aa_sequence_taxon_equalized.tsv.gz
(intermediary file)
-
peptides.tsv.gz
(intermediary file) -
uniprot_entries.tsv.gz
(final output file)
None
This function is very similar to join_equalized_pepts_and_entries
, but joins the original peptide sequences (where I and L are considered to be different) with NCBI taxa IDs (instead of the equalized peptide sequences).
-
aa_sequence_taxon_original.tsv.gz
(intermediary file)
-
aa_sequence_taxon_equalized.tsv.gz
(intermediary file) -
aa_sequence_taxon_original.tsv.gz
(intermediary file)
None
This function starts by creating two FIFO files and writing data to them:
-
equalized: By reading in the intermediary file
aa_sequence_taxon_equalized.tsv.gz
, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique equalized peptides toequalized
.
AAAAA
AAAAAA
AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
AAAAAAAAAAAAAAAAGATCLER
AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
AAAAAAAAAAAAAAAASAGGK
AAAAAAAAAAAAAAAGAAGK
AAAAAAAAAAAAAAAGAGAGAK
-
original: By reading in the intermediary file
aa_sequence_taxon_original.tsv.gz
, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique original peptides tooriginal
-
Finally, both these temporary FIFO files are combined and only the unique sequence (over the two files) are kept, alphabetically sorted, numbered and written to the intermediary file
sequences.tsv.gz
:
1 AAAAA
2 AAAAAA
3 AAAAAAAAA
4 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
5 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
6 AAAAAAAAAAAAAAAAGATCLER
7 AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
8 AAAAAAAAAAAAAAAASAGGK
sequences.tsv.gz
(intermediary file)