-
Notifications
You must be signed in to change notification settings - Fork 2
Unipept Database Index
Pieter Verschaffelt edited this page Mar 21, 2023
·
2 revisions
During the construction of a targeted protein reference database, the database construction process needs to extract only those proteins that are associated with a specific list of organisms. In order to be able to efficiently extract only those proteins that are associated with the requested taxa, we have designed the Unipept Database Index.
A Unipept Database Index is a folder containing a set of zipped TSV-files, later referred to as chunks. The proteins in these chunks are sorted by taxon ID number and split into approximately equally large files.
Every chunk follows the same structure and contains lines of the format in the following example:
P19871 MTNRLQGKVALVTGGASGVGLEVVKLLLGEGAKVAFSDINEAAGQQLAAELGERSMFVRHDVSSEADWTLVMAAVQRRLGTLNVLVNNAGILLPGDMETGRLEDFSRLLKINTESVFIGCQQGIAAMKETGGSIINMASVSSWLPIEQYAGYSASKAAVSALTRAAALSCRKQGYAIRVNSIHPDGIYTPMMQASLPKGVSKEMVLHDPKLNRAGRAYMPERIAQLVLFLASDESSVMSGSELHADNSILGMGL 3-beta-hydroxysteroid dehydrogenase 121 1.1.1.51 GO:0035410;GO:0047045;GO:0047035;GO:0008202 IPR036291;IPR020904;IPR002347 swissprot 285
Q06191 MTINATVKEAGFRPASRISSIGVSEILKIGARAAAMKREGKPVIILGAGEPDFDTPDHVKQAASDAIHRGETKYTALDGTPELKKAIREKFQRENGLAYELDEITVATGAKQILFNAMMASLDPGDEVVIPTPYWTSYSDIVQICEGKPILIACDASSGFRLTAQKLEAAITPRTRWVLLNSPSNPSGAAYSAADYRPLLDVLLKHPHVWLLVDDMYEHIVYDAFRFVTPARLEPGLKDRTLTVNGVSKAYAMTGWRIGYAGGPRALIKAMAVVQSQATSCPSSVSQAASVAALNGPQDFLKERTESFQRRRNLVVNGLNAIEGLDCRVPEGAFYTFSGCAGVARRVTPSGKRIESDTDFCAYLLEDSHVAVVPGSAFGLSPYFRISYATSEAELKEALERISAACKRLS Aspartate aminotransferase 96 2.6.1.1 GO:0005737;GO:0004069;GO:0030170;GO:0009058 IPR004839;IPR004838;IPR015424;IPR015421;IPR015422 swissprot 382
P84887 MRWLDKFGESLSRSVAHKTSRRSVLRSVGKLMVGSAFVLPVLPVARAAGGGGSSSGADHISLNPDLANEDEVNSCDYWRHCAVDGFLCSCCGGTTTTCPPGSTPSPISWIGTCHNPHDGKDYLISYHDCCGKTACGRCQCNTQTRERPGYEFFLHNDVNWCMANENSTFHCTTSVLVGLAKN Aralkylamine dehydrogenase light chain 77 1.4.9.2 GO:0042597;GO:0030058;GO:0030059;GO:0009308 IPR016008;IPR036560;IPR013504;IPR006311 swissprot 511
Each line of this file thus contains the following columns:
- Protein identifier: UniProt Accession number.
- Protein sequence
- Protein name
- Version (entry): Version number of this entry as reported by UniProt
- Associated EC-numbers: Semicolon-separated list of EC-numbers associated with this protein.
- Associated GO-terms: Semicolon-separated list of GO-terms associated with this protein.
- Associated InterPro-entries: Semicolon-separated list of InterPro-entries associated with this protein.
- Database type name: Either SwissProt or TrEMBL.
- Associated taxon ID: NCBI ID of the taxon associated with this protein.