-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the Unipept Database wiki. This repository contains all code that orchestrates the construction and structure of the Unipept Database, a peptide-centric database derived from the UniProtKB-resource which ultimately powers the Unipept metaproteomics analysis platform (see https://unipept.ugent.be).
The construction of the Unipept Database is performed by invoking the build_database.sh
script.
This script resides in the scripts
folder of this repository and can be started from the command line on a server, on your local machine or in a Docker container.
Example implementations for such a Docker container can be found in this repository.
On completion, the build_database.sh
script will produce a list of compressed TSV-files that contain all data that subsequently needs to be fed into a relational database management system (such as MySQL or PostgreSQL).
All information on this wiki serves as an extensive reference for the output format of each of these files, and how the database construction process works internally.
The construction process for the Unipept Database requires and produces a lot of different files that can be categorized in two different categories:
Since build_database.sh
is a shell-script that has a very complex task to adhere to, we have developed a list of helper scripts (either written in Java or JavaScript) that are invoked by the main script and that each have a very specific function.
Below you can find a list of all of the helper scripts (that reside in the scripts/helper_scripts
folder) and what input they require and what output they produce.
- LineagesSequencesTaxons2LCAs.jar
- NamesNodes2TaxonsLineages.jar
- TaxonsUniprots2Tables.jar
- XmlToTabConverter.jar
- FunctionalAnalysisPeptides.js
- TaxaByChunk.js
- WriteToChunk.js
- filter_taxa.sh
The most important script in this repository, build_database.sh
which orchestrates the complete database construction process, consists of a series of complex steps which are all described in detail below.
-
PEPTIDE_MIN_LENGTH
: The minimum length (inclusive) for tryptic peptides. -
PEPTIDE_MAX_LENGTH
: The maximum length (inclusive) for tryptic peptides.
-
TABDIR
: Folder in which the final TSV-files should be stored. -
INTDIR
: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times).
-
JAVA_MEM
: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously. -
CMD_SORT
: The particular Unixsort
command that should be used (including relevant options). -
CMD_GZIP
: The particular pipe compression command that should be used (including relevant options). -
ENTREZ_BATCH_SIZE
: The size of requests that should be used to communicate with Entrez.
-
TAXON_URL
: URL of a NCBI taxon dump that adheres to the file format described here. -
EC_CLASS_URL
: URL for a file with a listing of all EC-numbers and their associated name. Must adhere to the file format described here.
- URL used for downloading the NCBI taxonomy ZIP dump.
Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz
and lineages.tsv.gz
).
-
taxons.tsv.gz
(final output file) -
lineages.tsv.gz
(final output file)
This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.
The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.
For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.