Skip to content
Pieter Verschaffelt edited this page Mar 21, 2023 · 16 revisions

Welcome to the Unipept Database wiki. This repository contains all code that orchestrates the construction and structure of the Unipept Database, a peptide-centric database derived from the UniProtKB-resource which ultimately powers the Unipept metaproteomics analysis platform (see https://unipept.ugent.be).

Database construction

The construction of the Unipept Database is performed by invoking the build_database.sh script. This script resides in the scripts folder of this repository and can be started from the command line on a server, on your local machine or in a Docker container. Example implementations for such a Docker container can be found in this repository.

On completion, the build_database.sh script will produce a list of compressed TSV-files that contain all data that subsequently needs to be fed into a relational database management system (such as MySQL or PostgreSQL). All information on this wiki serves as an extensive reference for the output format of each of these files, and how the database construction process works internally.

File formats

The construction process for the Unipept Database requires and produces a lot of different files that can be categorized in two different categories:

Helper scripts

Since build_database.sh is a shell-script that has a very complex task to adhere to, we have developed a list of helper scripts (either written in Java or JavaScript) that are invoked by the main script and that each have a very specific function. Below you can find a list of all of the helper scripts (that reside in the scripts/helper_scripts folder) and what input they require and what output they produce.

Overview of build_database.sh

The most important script in this repository, build_database.sh which orchestrates the complete database construction process, consists of a series of complex steps which are all described in detail below.

Variables

Tryptic digest

  • PEPTIDE_MIN_LENGTH: The minimum length (inclusive) for tryptic peptides.
  • PEPTIDE_MAX_LENGTH: The maximum length (inclusive) for tryptic peptides.

File storage locations

  • TABDIR: Folder in which the final TSV-files should be stored.
  • INTDIR: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times).

System and memory configuration

  • JAVA_MEM: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously.
  • CMD_SORT: The particular Unix sort command that should be used (including relevant options).
  • CMD_GZIP: The particular pipe compression command that should be used (including relevant options).
  • ENTREZ_BATCH_SIZE: The size of requests that should be used to communicate with Entrez.

Resources

  • TAXON_URL: URL of a NCBI taxon dump that adheres to the file format described here.
  • EC_CLASS_URL: URL for a file with a listing of all EC-numbers and their associated name. Must adhere to the file format described here.

Functions

create_taxon_tables

Input / requirements
  • URL used for downloading the NCBI taxonomy ZIP dump.
Implementation

Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz and lineages.tsv.gz).

Output

download_and_convert_all_sources

Input / requirements
Implementation

This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.

The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.

Output

For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.

Clone this wiki locally