Home

Welcome to the Unipept Database wiki. This repository contains all code that orchestrates the construction and structure of the Unipept Database, a peptide-centric database derived from the UniProtKB-resource which ultimately powers the Unipept metaproteomics analysis platform (see https://unipept.ugent.be).

Database construction

The construction of the Unipept Database is performed by invoking the build_database.sh script. This script resides in the scripts folder of this repository and can be started from the command line on a server, on your local machine or in a Docker container. Example implementations for such a Docker container can be found in this repository.

On completion, the build_database.sh script will produce a list of compressed TSV-files that contain all data that subsequently needs to be fed into a relational database management system (such as MySQL or PostgreSQL). All information on this wiki serves as an extensive reference for the output format of each of these files, and how the database construction process works internally.

File formats

The construction process for the Unipept Database requires and produces a lot of different files that can be categorized in two different categories:

Helper scripts

Since build_database.sh is a shell-script that has a very complex task to adhere to, we have developed a list of helper scripts (either written in Java or JavaScript) that are invoked by the main script and that each have a very specific function. Below you can find a list of all of the helper scripts (that reside in the scripts/helper_scripts folder) and what input they require and what output they produce.

Overview of `build_database.sh`

The most important script in this repository, build_database.sh which orchestrates the complete database construction process, consists of a series of complex steps which are all described in detail below.

Variables

Tryptic digest

PEPTIDE_MIN_LENGTH: The minimum length (inclusive) for tryptic peptides.
PEPTIDE_MAX_LENGTH: The maximum length (inclusive) for tryptic peptides.

File storage locations

TABDIR: Folder in which the final TSV-files should be stored.
INTDIR: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times).

System and memory configuration

JAVA_MEM: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously.
CMD_SORT: The particular Unix sort command that should be used (including relevant options).
CMD_GZIP: The particular pipe compression command that should be used (including relevant options).
ENTREZ_BATCH_SIZE: The size of requests that should be used to communicate with Entrez.

Resources

TAXON_URL: URL of a NCBI taxon dump that adheres to the file format described here.
EC_CLASS_URL: URL for a file with a listing of all EC-numbers and their associated name. Must adhere to the file format described here.

Functions

create_taxon_tables

Input / requirements

URL used for downloading the NCBI taxonomy ZIP dump.

Implementation

Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz and lineages.tsv.gz).

Output

taxons.tsv.gz (final output file)
lineages.tsv.gz (final output file)

download_and_convert_all_sources

Input / requirements

Implementation

This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.

The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.

Output

For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Database construction

File formats

Helper scripts

Overview of `build_database.sh`

Variables

Tryptic digest

File storage locations

System and memory configuration

Resources

Functions

create_taxon_tables

Input / requirements

Implementation

Output

download_and_convert_all_sources

Input / requirements

Implementation

Output

Clone this wiki locally

Home

Database construction

File formats

Helper scripts

Overview of build_database.sh

Variables

Tryptic digest

File storage locations

System and memory configuration

Resources

Functions

create_taxon_tables

Input / requirements

Implementation

Output

download_and_convert_all_sources

Input / requirements

Implementation

Output

Clone this wiki locally

Overview of `build_database.sh`