Skip to content

Unipept Database Index

Pieter Verschaffelt edited this page Mar 21, 2023 · 2 revisions

During the construction of a targeted protein reference database, the database construction process needs to extract only those proteins that are associated with a specific list of organisms. In order to be able to efficiently extract only those proteins that are associated with the requested taxa, we have designed the Unipept Database Index.

A Unipept Database Index is a folder containing a set of zipped TSV-files, later referred to as chunks. The proteins in these chunks are sorted by taxon ID number and split into approximately equally large files.

Chunk structure

Every chunk follows the same structure and contains lines of the format in the following example:

P19871  MTNRLQGKVALVTGGASGVGLEVVKLLLGEGAKVAFSDINEAAGQQLAAELGERSMFVRHDVSSEADWTLVMAAVQRRLGTLNVLVNNAGILLPGDMETGRLEDFSRLLKINTESVFIGCQQGIAAMKETGGSIINMASVSSWLPIEQYAGYSASKAAVSALTRAAALSCRKQGYAIRVNSIHPDGIYTPMMQASLPKGVSKEMVLHDPKLNRAGRAYMPERIAQLVLFLASDESSVMSGSELHADNSILGMGL   3-beta-hydroxysteroid dehydrogenase     121      1.1.1.51        GO:0035410;GO:0047045;GO:0047035;GO:0008202     IPR036291;IPR020904;IPR002347   swissprot       285
Q06191  MTINATVKEAGFRPASRISSIGVSEILKIGARAAAMKREGKPVIILGAGEPDFDTPDHVKQAASDAIHRGETKYTALDGTPELKKAIREKFQRENGLAYELDEITVATGAKQILFNAMMASLDPGDEVVIPTPYWTSYSDIVQICEGKPILIACDASSGFRLTAQKLEAAITPRTRWVLLNSPSNPSGAAYSAADYRPLLDVLLKHPHVWLLVDDMYEHIVYDAFRFVTPARLEPGLKDRTLTVNGVSKAYAMTGWRIGYAGGPRALIKAMAVVQSQATSCPSSVSQAASVAALNGPQDFLKERTESFQRRRNLVVNGLNAIEGLDCRVPEGAFYTFSGCAGVARRVTPSGKRIESDTDFCAYLLEDSHVAVVPGSAFGLSPYFRISYATSEAELKEALERISAACKRLS        Aspartate aminotransferase      96       2.6.1.1 GO:0005737;GO:0004069;GO:0030170;GO:0009058     IPR004839;IPR004838;IPR015424;IPR015421;IPR015422       swissprot       382
P84887  MRWLDKFGESLSRSVAHKTSRRSVLRSVGKLMVGSAFVLPVLPVARAAGGGGSSSGADHISLNPDLANEDEVNSCDYWRHCAVDGFLCSCCGGTTTTCPPGSTPSPISWIGTCHNPHDGKDYLISYHDCCGKTACGRCQCNTQTRERPGYEFFLHNDVNWCMANENSTFHCTTSVLVGLAKN   Aralkylamine dehydrogenase light chain  77      1.4.9.2 GO:0042597;GO:0030058;GO:0030059;GO:0009308     IPR016008;IPR036560;IPR013504;IPR006311  swissprot       511

Each line of this file thus contains the following columns:

  • Protein identifier: UniProt Accession number.
  • Protein sequence
  • Protein name
  • Version (entry): Version number of this entry as reported by UniProt
  • Associated EC-numbers: Semicolon-separated list of EC-numbers associated with this protein.
  • Associated GO-terms: Semicolon-separated list of GO-terms associated with this protein.
  • Associated InterPro-entries: Semicolon-separated list of InterPro-entries associated with this protein.
  • Database type name: Either SwissProt or TrEMBL.
  • Associated taxon ID: NCBI ID of the taxon associated with this protein.
Clone this wiki locally