-
Notifications
You must be signed in to change notification settings - Fork 2
Build Database (sh)
This is the most important script in this repository and orchestrates the complete database construction process. All information required to understand what this script consists of and what each step or function does are explained in detail below.
-
DB_TYPES
: An array containing all database types that should be processed by this script. -
DB_SOURCES
: An array containing all database URLs that should be processed by this script. The$i$ 'th URL in this array should correspond with the$i$ 'th database type provided in theDB_TYPES
array.
-
PEPTIDE_MIN_LENGTH
: The minimum length (inclusive) for tryptic peptides. -
PEPTIDE_MAX_LENGTH
: The maximum length (inclusive) for tryptic peptides.
-
OUTPUT_DIR
: Folder in which the final TSV-files should be stored. -
INTDIR
: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times). -
TEMP_DIR
: Folder in which temporary files, required by this script, can be stored.
-
JAVA_MEM
: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously. -
CMD_SORT
: The particular Unixsort
command that should be used (including relevant options). -
CMD_GZIP
: The particular pipe compression command that should be used (including relevant options). -
ENTREZ_BATCH_SIZE
: The size of requests that should be used to communicate with Entrez.
-
TAXON_URL
: URL of a NCBI taxon dump that adheres to the file format described here. -
EC_CLASS_URL
: URL of a file with a listing of EC-numbers on theclass
,subclass
andsubsubclass
level (including their associated name). Must adhere to the file format described here. -
EC_NUMBER_URL
: URL of a file with a listing of EC-numbers on the deepest level of the ontology. Must adhere to the file format described here. -
GO_TERM_URL
: URL of a file with all GO-terms. -
INTERPRO_URL
: URL of a file with all InterPro-entries.
None
Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz
and lineages.tsv.gz
).
-
taxons.tsv.gz
(final output file) -
lineages.tsv.gz
(final output file)
None
This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.
The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.
For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.
This function simply chains filter_sources_by_taxa
and create_most_tables
together.
This function will reproduce all proteins from the input Unipept Database Index on stdout
that are associated with one of the taxa that are provided by the TAXA
variable.
Note that proteins associated with a taxon that is a child of one of the TAXA
variables will also be kept in the output.
-
stdout
: Produces a TSV onstdout
containing one protein per line, according to the format described here.
-
stdin
: Reads a TSV fromstdin
containing one protein per line, according to the format described here. -
taxons.tsv.gz
(final output file)
This function reads in all protein data from stdout
and passes it onto the helper program listed above.
After parsing the TSV from the input, TaxonsUniprots2Tables.jar
will generate a collection of compressed TSV-files that will be used further along this script.
The most important of these is the peptides.tsv.gz
file, which contains a list of all tryptic peptides that are the result of the in-silico tryptic digest that is performed by TaxonsUniprots2Tables.jar
.
-
uniprot_entries.tsv.gz
(final output file) -
ec_cross_references.tsv.gz
(final output file) -
go_cross_references.tsv.gz
(final output file) -
interpro_cross_references.tsv.gz
(final output file) -
peptides.tsv.gz
(intermediary file)
-
peptides.tsv.gz
(intermediary file) -
uniprot_entries.tsv.gz
(final output file)
None
This function starts by creating two FIFO files and writing data to them:
-
peptides_eq: By reading in the intermediary file
peptides.tsv.gz
and extracting only the equalized peptide sequence and the associated UniProt entry ID, we end up with this data forpeptides_eq
(each line consists of the UniProt entry ID and equalized peptide sequence):
000000000001 GGLSVPGPMGPSGPR
000000000001 GLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPR
000000000001 GPPGPPGK
-
entries_eq: By reading in the final output file
uniprot_entries.tsv.gz
and extracting only the UniProt entry ID and the associated NCBI taxon ID, we end up with this data forentries_eq
(each line consists of the UniProt entry ID and the NCBI taxon ID):
000000000001 2546662
000000000002 2546656
000000000003 2591764
000000000004 2546663
000000000005 2587412
- The third step in the implementation of this function takes care of joining both of the files generated above by UniProt entry ID (effectively coupling equalized peptide sequences and NCBI taxa IDs) and then sorting the output file by peptide sequence. The final rresult is written to the intermediary file
aa_sequence_taxon_equalized.tsv.gz
:
AAAAA 1063
AAAAA 243160
AAAAA 271848
AAAAA 272560
AAAAA 31716
AAAAA 32037
-
aa_sequence_taxon_equalized.tsv.gz
(intermediary file)
-
peptides.tsv.gz
(intermediary file) -
uniprot_entries.tsv.gz
(final output file)
None
This function is very similar to join_equalized_pepts_and_entries
, but joins the original peptide sequences (where I and L are considered to be different) with NCBI taxa IDs (instead of the equalized peptide sequences).
-
aa_sequence_taxon_original.tsv.gz
(intermediary file)
-
aa_sequence_taxon_equalized.tsv.gz
(intermediary file) -
aa_sequence_taxon_original.tsv.gz
(intermediary file)
None
This function starts by creating two FIFO files and writing data to them:
-
equalized: By reading in the intermediary file
aa_sequence_taxon_equalized.tsv.gz
, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique equalized peptides toequalized
.
AAAAA
AAAAAA
AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
AAAAAAAAAAAAAAAAGATCLER
AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
AAAAAAAAAAAAAAAASAGGK
AAAAAAAAAAAAAAAGAAGK
AAAAAAAAAAAAAAAGAGAGAK
-
original: By reading in the intermediary file
aa_sequence_taxon_original.tsv.gz
, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique original peptides tooriginal
-
Finally, both these temporary FIFO files are combined and only the unique sequence (over the two files) are kept, alphabetically sorted, numbered and written to the intermediary file
sequences.tsv.gz
:
1 AAAAA
2 AAAAAA
3 AAAAAAAAA
4 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
5 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
6 AAAAAAAAAAAAAAAAGATCLER
7 AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
8 AAAAAAAAAAAAAAAASAGGK
sequences.tsv.gz
(intermediary file)
-
sequences.tsv.gz
(intermediary file) -
aa_sequence_taxon_equalized.tsv.gz
(intermediary file) -
lineages.tsv.gz
(final output file)
- The function starts by joining the intermediary
sequences.tsv.gz
andaa_sequence_taxon_equalized.tsv.gz
files by peptide sequence:
1 1063
1 243160
1 271848
1 272560
1 31716
1 320372
1 320373
1 320388
1 320389
1 331271
-
Next, the data above is sent via
stdin
to theLineagesSequencesTaxons2LCAs.jar
application for further processing. Since most peptides occur multiple times in the input for this application, it is able to collect and aggregate all taxa IDs that belong to one peptide and compute the lowest common ancestor for each peptide. The final output will be sent tostdout
. -
Finally, the output of step 2 will be compressed and written to a new intermediary file
LCAs_equalized.tsv.gz
:
1 1
2 87882
3 272568
5 7227
6 9606
7 502779
8 314146
-
LCAs_equalized.tsv.gz
(intermediary file)
-
sequences.tsv.gz
(intermediary file) -
aa_sequence_taxon_original.tsv.gz
(intermediary file) -
lineages.tsv.gz
(final output file)
This function works identical to calculate_equalized_lcas
.
-
LCAs_original.tsv.gz
(intermediary file)
-
peptides.tsv.gz
(intermediary file) -
sequences.tsv.gz
(intermediary file)
None
-
First, this function will read the
peptides.tsv.gz
file and sort it according to the equalized peptide sequence string (2nd column). -
Secondly, whenever the sorting is complete, the
peptides.tsv.gz
file will be joined with thesequences.tsv.gz
file using the peptide sequence and produces the following result (which is compressed and written to the filepeptides_by_equalized.tsv.gz
):
11161754 1 AAAAA 470096 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161773 1 AAAAA 470097 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161792 1 AAAAA 470098 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161811 1 AAAAA 470099 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
-
peptides_by_equalized.tsv.gz
(intermediary file)
-
peptides_by_equalized.tsv.gz
(intermediary file) -
sequences.tsv.gz
(intermediary file)
None
The implementation of this function is completely analog to the one of substitute_equalized_aas
.
-
peptides_by_original.tsv.gz
(intermediary file)
-
peptides_by_equalized.tsv.gz
(intermediary file)
- The function starts by creating a new FIFO file
peptides_eq
that consists of the following data:
1 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1 GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
peptides_eq
thus contains a list of lines consisting of sequence IDs associated with a list of functional annotations.
- Most sequence IDs appear more than once in
peptides_eq
, indicating that all functional annotations of equal sequence IDs belong together and need to be aggregated by the helper scriptFunctionalAnalysisPeptides.js
. This script produces a new intermediary fileFAs_equalized.tsv.gz
that looks like this:
1 {"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
2 {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
3 {"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}}
-
FAs_equalized.tsv.gz
(intermediary file)
-
peptides_by_original.tsv.gz
(intermediary file)
The implementation of this function is completely analog to the one of calculate_equalized_fas
.
-
FAs_original.tsv.gz
(intermediary file)
None
Simply sorts all peptides by peptide ID.
-
peptides.tsv.gz
(final output file)
-
LCAs_original.tsv.gz
(intermediary file) -
LCAs_equalized.tsv.gz
(intermediary file) -
FAs_original.tsv.gz
(intermediary file) -
FAs_equalized.tsv.gz
(intermediary file) -
sequences.tsv.gz
(intermediary file)
None
The first starts by creating 4 different FIFO files.
-
olcas
: This temporary FIFO file contains a mapping between peptide sequence IDs and LCA IDs in the case that the amino acids I and L are considered to be different:
000000000001 1
000000000002 87882
000000000003 272568
000000000004 7227
000000000006 9606
000000000007 502779
-
elcas
: Similar to theolcas
FIFO, but for the case that I and L are considered equal:
000000000001 1
000000000002 87882
000000000003 272568
000000000005 7227
000000000006 9606
000000000007 502779
-
ofas
: This temporary FIFO file contains a mapping between peptide sequence IDs and a functional annotations object (in the case that I and L are considered to be different):
000000000001 {"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002 {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
-
efas
: Similar to theofas
FIFO, but for the case that I and L are considered equal:
000000000001 {"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002 {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
- The fifth step in this function performs most of the work and joins all FIFOs created above.
- We start with the intermediary
sequences.tsv.gz
file we generated previously:
000000000001 AAAAA
000000000002 AAAAAA
000000000003 AAAAAAAAA
000000000004 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
000000000005 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
000000000006 AAAAAAAAAAAAAAAAGATCLER
000000000007 AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
- During the first
join
operation, we join together the sequences and theolcas
FIFO which generates the following data (the LCA of the original peptide sequence is now added to the file):
000000000001 AAAAA 1
000000000002 AAAAAA 87882
000000000003 AAAAAAAAA 272568
000000000004 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK 7227
000000000005 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK \N
- The second
join
operation does something similar and adds the LCA of the equalized peptide sequence to the file:
000000000001 AAAAA 1 1
000000000002 AAAAAA 87882 87882
000000000003 AAAAAAAAA 272568 272568
000000000004 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK 7227 \N
000000000005 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK \N 7227
000000000006 AAAAAAAAAAAAAAAAGATCLER 9606 9606
- The third join adds the functional annotations of the original peptide sequence:
000000000001 AAAAA 1 1 {"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002 AAAAAA 87882 87882 {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
- And finally, the fourth join adds the functional annotations of the equalized peptide sequence:
000000000001 AAAAA 1 1 {"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}} {"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002 AAAAAA 87882 87882 {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}} {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
The final result (as can be seen above) is compressed and written to the final output file sequences.tsv.gz
.
-
sequences.tsv.gz
(final output file)