Skip to content

Build Database (sh)

Pieter Verschaffelt edited this page Mar 22, 2023 · 8 revisions

This is the most important script in this repository and orchestrates the complete database construction process. All information required to understand what this script consists of and what each step or function does are explained in detail below.

Variables

Sources

  • DB_TYPES: An array containing all database types that should be processed by this script.
  • DB_SOURCES: An array containing all database URLs that should be processed by this script. The $i$'th URL in this array should correspond with the $i$'th database type provided in the DB_TYPES array.

Tryptic digest

  • PEPTIDE_MIN_LENGTH: The minimum length (inclusive) for tryptic peptides.
  • PEPTIDE_MAX_LENGTH: The maximum length (inclusive) for tryptic peptides.

File storage locations

  • OUTPUT_DIR: Folder in which the final TSV-files should be stored.
  • INTDIR: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times).
  • TEMP_DIR: Folder in which temporary files, required by this script, can be stored.

System and memory configuration

  • JAVA_MEM: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously.
  • CMD_SORT: The particular Unix sort command that should be used (including relevant options).
  • CMD_GZIP: The particular pipe compression command that should be used (including relevant options).
  • ENTREZ_BATCH_SIZE: The size of requests that should be used to communicate with Entrez.

Resources

  • TAXON_URL: URL of a NCBI taxon dump that adheres to the file format described here.
  • EC_CLASS_URL: URL of a file with a listing of EC-numbers on the class, subclass and subsubclass level (including their associated name). Must adhere to the file format described here.
  • EC_NUMBER_URL: URL of a file with a listing of EC-numbers on the deepest level of the ontology. Must adhere to the file format described here.
  • GO_TERM_URL: URL of a file with all GO-terms.
  • INTERPRO_URL: URL of a file with all InterPro-entries.

Functions

create_taxon_tables

Input files

Helper scripts

None

Implementation

Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz and lineages.tsv.gz).

Output

download_and_convert_all_sources

Input files

None

Helper scripts

Implementation

This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.

The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.

Output

For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.

create_tables_and_filter

This function simply chains filter_sources_by_taxa and create_most_tables together.

filter_sources_by_taxa

Input files

Helper scripts

Implementation

This function will reproduce all proteins from the input Unipept Database Index on stdout that are associated with one of the taxa that are provided by the TAXA variable. Note that proteins associated with a taxon that is a child of one of the TAXA variables will also be kept in the output.

Output

  • stdout: Produces a TSV on stdout containing one protein per line, according to the format described here.

create_most_tables

Input files

  • stdin: Reads a TSV from stdin containing one protein per line, according to the format described here.
  • taxons.tsv.gz (final output file)

Helper scripts

Implementation

This function reads in all protein data from stdout and passes it onto the helper program listed above. After parsing the TSV from the input, TaxonsUniprots2Tables.jar will generate a collection of compressed TSV-files that will be used further along this script. The most important of these is the peptides.tsv.gz file, which contains a list of all tryptic peptides that are the result of the in-silico tryptic digest that is performed by TaxonsUniprots2Tables.jar.

Output files

join_equalized_pepts_and_entries

Input files

Helper scripts

None

Implementation

This function starts by creating two FIFO files and writing data to them:

  1. peptides_eq: By reading in the intermediary file peptides.tsv.gz and extracting only the equalized peptide sequence and the associated UniProt entry ID, we end up with this data for peptides_eq (each line consists of the UniProt entry ID and equalized peptide sequence):
000000000001	GGLSVPGPMGPSGPR
000000000001	GLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPR
000000000001	GPPGPPGK
  1. entries_eq: By reading in the final output file uniprot_entries.tsv.gz and extracting only the UniProt entry ID and the associated NCBI taxon ID, we end up with this data for entries_eq (each line consists of the UniProt entry ID and the NCBI taxon ID):
000000000001	2546662
000000000002	2546656
000000000003	2591764
000000000004	2546663
000000000005	2587412
  1. The third step in the implementation of this function takes care of joining both of the files generated above by UniProt entry ID (effectively coupling equalized peptide sequences and NCBI taxa IDs) and then sorting the output file by peptide sequence. The final rresult is written to the intermediary file aa_sequence_taxon_equalized.tsv.gz:
AAAAA	1063
AAAAA	243160
AAAAA	271848
AAAAA	272560
AAAAA	31716
AAAAA	32037

Output files

join_original_pepts_and_entries

Input files

Helper scripts

None

Implementation

This function is very similar to join_equalized_pepts_and_entries, but joins the original peptide sequences (where I and L are considered to be different) with NCBI taxa IDs (instead of the equalized peptide sequences).

Output files

number_sequences

Input files

Helper scripts

None

Implementation

This function starts by creating two FIFO files and writing data to them:

  1. equalized: By reading in the intermediary file aa_sequence_taxon_equalized.tsv.gz, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique equalized peptides to equalized.
AAAAA
AAAAAA
AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
AAAAAAAAAAAAAAAAGATCLER
AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
AAAAAAAAAAAAAAAASAGGK
AAAAAAAAAAAAAAAGAAGK
AAAAAAAAAAAAAAAGAGAGAK
  1. original: By reading in the intermediary file aa_sequence_taxon_original.tsv.gz, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique original peptides to original

  2. Finally, both these temporary FIFO files are combined and only the unique sequence (over the two files) are kept, alphabetically sorted, numbered and written to the intermediary file sequences.tsv.gz:

1	AAAAA
2	AAAAAA
3	AAAAAAAAA
4	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
5	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
6	AAAAAAAAAAAAAAAAGATCLER
7	AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
8	AAAAAAAAAAAAAAAASAGGK

Output files

sequences.tsv.gz (intermediary file)

Clone this wiki locally