Skip to content

Commit

Permalink
cleanup of the suffix-array schemas
Browse files Browse the repository at this point in the history
  • Loading branch information
tibvdm committed Aug 6, 2024
1 parent 5d1340c commit b64e684
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 410 deletions.
151 changes: 4 additions & 147 deletions schemas_suffix_array/headers.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,159 +2,16 @@
Headers
=======

Taxons
------
- ***id***: The taxon's identifier, as assigned by the NCBI. An
integer number. This column may contain gaps.
- ***name***: The taxon's name.
- ***rank***: The taxonomic rank of this taxon. Unranked taxa have
the `no rank` value. The value should be any of:
* no rank
* superkingdom
* kingdom
* subkingdom
* superphylum
* phylum
* subphylum
* superclass
* class
* subclass
* superorder
* order
* suborder
* infraorder
* superfamily
* family
* subfamily
* tribe
* subtribe
* genus
* subgenus
* species group
* species subgroup
* species
* subspecies
* strain
* varietas
* forma
- ***parent***: The taxon id of the parent. Refers to another entry
in this table (or itself, in case of the root taxon).

EC numbers
----------
- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***code***: The EC number in the form of x.x.x.x.
- ***name***: The full name of this EC number.

GO terms
----------
- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***code***: The go term itself
- ***namespace***: The namespace: 'biological process', 'molecular function' or 'cellular component'
- ***name***: The full name (description) of this go term.

Lineages
--------

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***taxon id***: Refers to the taxon this lineages indicates the
lineage of.

Next is a number of taxonomic rank fields, each referring to a taxon
by id. The code is as follows:
- If the taxon has a valid ancestor with this rank, the value is
equal to that ancestor's id.
- If the taxon has an invalid ancestor with this rank, the value is
the negated value of that ancestor's id.
- If the taxon has no ancestor with this rank but has an ancestor
with lower rank (further from root), the value is either \N (null)
if that lower ancestor is valid, or -1, otherwise.
- If the taxon has no ancestor with this rank and no ancestors of
lower rank, the value is \N.

These fields appear in the same order as the values of the taxonomic
rank were mentioned before.

Sequences (ore Sequences_compressed)
---------

Contains the tryptic peptides. This table may be in compressed form as
`squences_compressed` in this case the view `sequences` decompresses the data.

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***sequence***: An Amino Acid sequence, more precisely a tryptic
peptide.
- ***original lca***: A lowest common ancestor in case we did not
equate the I and L amino acids.
- ***lca***: The lowest common ancestor of all proteins containing
this tryptic peptide. Refers to the taxon table.
- ***original fa***: A JSON summary of the functional annotations
in case we did not equate the I and L amino acids.
- ***fa***: The JSON summary of the functional annotations of all
proteins containing this tryptic peptide. Refers to the taxon table.

The JSON summary has 2 fields:

- `num`: Showing statistics about the found annotations
- `all`: The total number of matched proteins
- `EC` : The number of matched proteins with ≥ 1 EC annotation
- `GO` : The number of matched proteins with ≥ 1 GO annotation

EC Cross References
-------------------

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***uniprot entry id***: Which uniprot entry we are referencing.
- ***ec number code***: An EC reference of the uniprot entry.

EMBL Cross References
---------------------

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***uniprot entry id***: Refers to the uniprot entry this is the
EMBL reference for.
- ***protein id***: The EMBL protein id.
- ***sequence id***: The EMBL Sequence id.

GO Cross References
-------------------

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***uniprot entry id***: Which uniprot entry we are referencing.
- ***go term code***: A GenBank reference of the uniprot entry.

Peptides
--------

Links the sequences back to the proteins they were cut from.

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***sequence id***: Refers to a sequence in the sequences table.
- ***original sequence id***: Refers to the same sequence, without
the I and L equated.
- ***uniprot entry id***: Refers to the protein these tryptic
peptides were digested from.

RefSeq Cross References
-----------------------

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***uniprot entry id***: Refers to the uniprot entry this is the
refseq reference for.
- ***protein id***: The RefSeq protein id.
- ***sequence id***: The RefSeq Sequence id.

Uniprot Entries
---------------

The proteins parsed from uniprot.

- ***id***: A self-assigned id. Integral, incremental, no gaps.
- ***uniprot id***: The uniprot accession number of this protein. Not
a Number.
- ***uniprot_accession_number***: The uniprot accession number of this protein. Not a Number.
- ***version***: The version of this protein.
- ***taxon id***: Refers to the taxon this protein was gained from.
- ***type***: Either swissprot of trembl, depending on the source of
the data.
- ***type***: Either swissprot of trembl, depending on the source of the data.
- ***name***: Uniprot assigned name of the protein.
- ***sequence***: The Amino Acid sequence of this protein.
- ***protein***: The Amino Acid sequence of this protein.
- ***fa***: Compressed binary data representing the functional annotations of a protein
38 changes: 0 additions & 38 deletions schemas_suffix_array/sample_data.sql

This file was deleted.

219 changes: 4 additions & 215 deletions schemas_suffix_array/structure.sql
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,6 @@ SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL';
CREATE SCHEMA IF NOT EXISTS `unipept` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci ;
USE `unipept` ;

-- -----------------------------------------------------
-- Table `unipept`.`taxons`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`taxons` (
`id` MEDIUMINT UNSIGNED NOT NULL ,
`name` VARCHAR(120) NOT NULL ,
`rank` ENUM('no rank', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'superclass', 'class', 'subclass', 'superorder', 'order', 'suborder', 'infraorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'strain', 'varietas', 'forma' ) NULL DEFAULT NULL ,
`parent_id` MEDIUMINT UNSIGNED NULL DEFAULT NULL ,
`valid_taxon` BIT NOT NULL DEFAULT 1 ,
PRIMARY KEY (`id`) ,
INDEX `fk_taxon_taxon` (`parent_id` ASC) ,
CONSTRAINT `fk_taxon_taxon`
FOREIGN KEY (`parent_id` )
REFERENCES `unipept`.`taxons` (`id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`uniprot_entries`
-- -----------------------------------------------------
Expand All @@ -36,202 +15,12 @@ CREATE TABLE IF NOT EXISTS `unipept`.`uniprot_entries` (
`taxon_id` MEDIUMINT UNSIGNED NOT NULL,
`type` ENUM('swissprot', 'trembl') NOT NULL,
`name` VARCHAR(150) NOT NULL,
`protein` TEXT NOT NULL,
`protein` TEXT NOT NULL ,
`fa` TEXT NOT NULL ,
PRIMARY KEY (`id`),
INDEX `fk_uniprot_entries_taxons_idx` (`taxon_id` ASC),
UNIQUE INDEX `idx_uniprot_entries_accession` (`uniprot_accession_number` ASC),
CONSTRAINT `fk_uniprot_entries_taxons`
FOREIGN KEY (`taxon_id`)
REFERENCES `unipept`.`taxons` (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = ascii
COLLATE = ascii_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`ec_numbers`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`ec_numbers` (
`id` SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
`code` VARCHAR(15) NOT NULL,
`name` VARCHAR(140) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `ec_number_UNIQUE` (`code` ASC))
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`go_terms`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`go_terms` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`code` VARCHAR(15) NOT NULL,
`namespace` ENUM('biological process', 'molecular function', 'cellular component') NOT NULL,
`name` VARCHAR(200) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `uidx_code` (`code` ASC))
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`interpro`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`interpro_entries` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`code` VARCHAR(9) NOT NULL,
`category` VARCHAR(32) NOT NULL,
`name` VARCHAR(160) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `idx_interpro_code` (`code` ASC))
ENGINE = InnoDB
DEFAULT CHARACTER SET = ascii
COLLATE = ascii_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`lineages`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`lineages` (
`taxon_id` MEDIUMINT UNSIGNED NOT NULL,
`superkingdom` MEDIUMINT NULL DEFAULT NULL,
`kingdom` MEDIUMINT NULL DEFAULT NULL,
`subkingdom` MEDIUMINT NULL DEFAULT NULL,
`superphylum` MEDIUMINT NULL DEFAULT NULL,
`phylum` MEDIUMINT NULL DEFAULT NULL,
`subphylum` MEDIUMINT NULL DEFAULT NULL,
`superclass` MEDIUMINT NULL DEFAULT NULL,
`class` MEDIUMINT NULL DEFAULT NULL,
`subclass` MEDIUMINT NULL DEFAULT NULL,
`superorder` MEDIUMINT NULL DEFAULT NULL,
`order` MEDIUMINT NULL DEFAULT NULL,
`suborder` MEDIUMINT NULL DEFAULT NULL,
`infraorder` MEDIUMINT NULL DEFAULT NULL,
`superfamily` MEDIUMINT NULL DEFAULT NULL,
`family` MEDIUMINT NULL DEFAULT NULL,
`subfamily` MEDIUMINT NULL DEFAULT NULL,
`tribe` MEDIUMINT NULL DEFAULT NULL,
`subtribe` MEDIUMINT NULL DEFAULT NULL,
`genus` MEDIUMINT NULL DEFAULT NULL,
`subgenus` MEDIUMINT NULL DEFAULT NULL,
`species_group` MEDIUMINT NULL DEFAULT NULL,
`species_subgroup` MEDIUMINT NULL DEFAULT NULL,
`species` MEDIUMINT NULL DEFAULT NULL,
`subspecies` MEDIUMINT NULL DEFAULT NULL,
`strain` MEDIUMINT NULL DEFAULT NULL,
`varietas` MEDIUMINT NULL DEFAULT NULL,
`forma` MEDIUMINT NULL DEFAULT NULL,
PRIMARY KEY (`taxon_id`),
INDEX `fk_lineages_taxons_idx` (`taxon_id` ASC),
CONSTRAINT `fk_lineages_taxons`
FOREIGN KEY (`taxon_id`)
REFERENCES `unipept`.`taxons` (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = ascii
COLLATE = ascii_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`datasets`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`datasets` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`environment` VARCHAR(160) NULL ,
`reference` VARCHAR(500) NULL ,
`url` VARCHAR(200) NULL ,
`project_website` VARCHAR(200) NULL ,
PRIMARY KEY (`id`) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`dataset_items`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`dataset_items` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`dataset_id` INT UNSIGNED NULL ,
`name` VARCHAR(160) NULL ,
`data` MEDIUMTEXT CHARACTER SET 'ascii' COLLATE 'ascii_general_ci' NOT NULL ,
`order` INT NULL ,
PRIMARY KEY (`id`) ,
INDEX `fk_dataset_items_datasets` (`dataset_id` ASC) ,
CONSTRAINT `fk_dataset_items_datasets`
FOREIGN KEY (`dataset_id` )
REFERENCES `unipept`.`datasets` (`id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;

-- -----------------------------------------------------
-- Table `unipept`.`go_cross_references`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`go_cross_references` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`uniprot_entry_id` INT UNSIGNED NOT NULL,
`go_term_code` VARCHAR(15) NOT NULL,
PRIMARY KEY (`id`),
INDEX `fk_go_reference_uniprot_entries` (`uniprot_entry_id` ASC),
INDEX `fk_go_cross_reference_go_terms_idx` (`go_term_code` ASC),
CONSTRAINT `fk_go_cross_reference_uniprot_entries`
FOREIGN KEY (`uniprot_entry_id`)
REFERENCES `unipept`.`uniprot_entries` (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_go_cross_reference_go_terms`
FOREIGN KEY (`go_term_code`)
REFERENCES `unipept`.`go_terms` (`code`)
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = ascii
COLLATE = ascii_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`ec_cross_references`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`ec_cross_references` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`uniprot_entry_id` INT UNSIGNED NOT NULL,
`ec_number_code` VARCHAR(15) NOT NULL,
PRIMARY KEY (`id`),
INDEX `fk_ec_reference_uniprot_entries` (`uniprot_entry_id` ASC),
INDEX `fk_ec_cross_reference_ec_numbers_idx` (`ec_number_code` ASC),
CONSTRAINT `fk_ec_cross_reference_uniprot_entries`
FOREIGN KEY (`uniprot_entry_id`)
REFERENCES `unipept`.`uniprot_entries` (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_ec_cross_reference_ec_numbers`
FOREIGN KEY (`ec_number_code`)
REFERENCES `unipept`.`ec_numbers` (`code`)
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = ascii
COLLATE = ascii_general_ci;


-- -----------------------------------------------------
-- Table `unipept`.`interpro_cross_references`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `unipept`.`interpro_cross_references` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`uniprot_entry_id` INT UNSIGNED NOT NULL ,
`interpro_entry_code` VARCHAR(9) NOT NULL ,
PRIMARY KEY (`id`),
INDEX `fk_interpro_reference_uniprot_entries` (`uniprot_entry_id` ASC))
UNIQUE INDEX `idx_uniprot_entries_accession` (`uniprot_accession_number` ASC)
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = ascii
COLLATE = ascii_general_ci;
Expand Down
Loading

0 comments on commit b64e684

Please sign in to comment.