Build Database (sh)

This is the most important script in this repository and orchestrates the complete database construction process. All information required to understand what this script consists of and what each step or function does are explained in detail below.

Variables

Sources

DB_TYPES: An array containing all database types that should be processed by this script.
DB_SOURCES: An array containing all database URLs that should be processed by this script. The $i$'th URL in this array should correspond with the $i$'th database type provided in the DB_TYPES array.

Tryptic digest

PEPTIDE_MIN_LENGTH: The minimum length (inclusive) for tryptic peptides.
PEPTIDE_MAX_LENGTH: The maximum length (inclusive) for tryptic peptides.

File storage locations

OUTPUT_DIR: Folder in which the final TSV-files should be stored.
INTDIR: Folder in which intermediate TSV-files should be stored (these are large, will be written once and read multiple times).
TEMP_DIR: Folder in which temporary files, required by this script, can be stored.

System and memory configuration

JAVA_MEM: How much memory is one Java-process allowed to use at the same time? Note that up to two Java-processes can be executed simultaneously.
CMD_SORT: The particular Unix sort command that should be used (including relevant options).
CMD_GZIP: The particular pipe compression command that should be used (including relevant options).
ENTREZ_BATCH_SIZE: The size of requests that should be used to communicate with Entrez.

Resources

TAXON_URL: URL of a NCBI taxon dump that adheres to the file format described here.
EC_CLASS_URL: URL of a file with a listing of EC-numbers on the class, subclass and subsubclass level (including their associated name). Must adhere to the file format described here.
EC_NUMBER_URL: URL of a file with a listing of EC-numbers on the deepest level of the ontology. Must adhere to the file format described here.
GO_TERM_URL: URL of a file with all GO-terms.
INTERPRO_URL: URL of a file with all InterPro-entries.

Functions

`create_taxon_tables`

Input files

taxdmp.zip
- names.dmp
- nodes.dmp

Helper scripts

None

Implementation

Downloads a dump containing all NCBI taxa identifiers (including associated names and ranks) and converts these to two final output tables (taxons.tsv.gz and lineages.tsv.gz).

Output

taxons.tsv.gz (final output file)
lineages.tsv.gz (final output file)

`download_and_convert_all_sources`

Input files

None

Helper scripts

Implementation

This function will check if a valid Unipept index already exists for each of the provided database types (and URLs). If this is not the case, a new index will be created for each new database.

The function checks the cache E-Tag that is present on the database source URL in order to detect if the current version of the database is outdated.

Output

For each of the input database types (and URLs), the function will create a matching reusable Unipept Database Index folder.

`create_tables_and_filter`

This function simply chains filter_sources_by_taxa and create_most_tables together.

`filter_sources_by_taxa`

Input files

Unipept Database Index

Helper scripts

filter_taxa.sh

Implementation

This function will reproduce all proteins from the input Unipept Database Index on stdout that are associated with one of the taxa that are provided by the TAXA variable. Note that proteins associated with a taxon that is a child of one of the TAXA variables will also be kept in the output.

Output

stdout: Produces a TSV on stdout containing one protein per line, according to the format described here.

`create_most_tables`

Input files

stdin: Reads a TSV from stdin containing one protein per line, according to the format described here.
taxons.tsv.gz (final output file)

Helper scripts

TaxonsUniprots2Tables.jar

Implementation

This function reads in all protein data from stdout and passes it onto the helper program listed above. After parsing the TSV from the input, TaxonsUniprots2Tables.jar will generate a collection of compressed TSV-files that will be used further along this script. The most important of these is the peptides.tsv.gz file, which contains a list of all tryptic peptides that are the result of the in-silico tryptic digest that is performed by TaxonsUniprots2Tables.jar.

Output files

uniprot_entries.tsv.gz (final output file)
ec_cross_references.tsv.gz (final output file)
go_cross_references.tsv.gz (final output file)
interpro_cross_references.tsv.gz (final output file)
peptides.tsv.gz (intermediary file)

`join_equalized_pepts_and_entries`

Input files

peptides.tsv.gz (intermediary file)
uniprot_entries.tsv.gz (final output file)

Helper scripts

None

Implementation

This function starts by creating two FIFO files and writing data to them:

peptides_eq: By reading in the intermediary file peptides.tsv.gz and extracting only the equalized peptide sequence and the associated UniProt entry ID, we end up with this data for peptides_eq (each line consists of the UniProt entry ID and equalized peptide sequence):

000000000001	GGLSVPGPMGPSGPR
000000000001	GLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPR
000000000001	GPPGPPGK

entries_eq: By reading in the final output file uniprot_entries.tsv.gz and extracting only the UniProt entry ID and the associated NCBI taxon ID, we end up with this data for entries_eq (each line consists of the UniProt entry ID and the NCBI taxon ID):

000000000001	2546662
000000000002	2546656
000000000003	2591764
000000000004	2546663
000000000005	2587412

The third step in the implementation of this function takes care of joining both of the files generated above by UniProt entry ID (effectively coupling equalized peptide sequences and NCBI taxa IDs) and then sorting the output file by peptide sequence. The final rresult is written to the intermediary file aa_sequence_taxon_equalized.tsv.gz:

AAAAA	1063
AAAAA	243160
AAAAA	271848
AAAAA	272560
AAAAA	31716
AAAAA	32037

Output files

aa_sequence_taxon_equalized.tsv.gz (intermediary file)

`join_original_pepts_and_entries`

Input files

peptides.tsv.gz (intermediary file)
uniprot_entries.tsv.gz (final output file)

Helper scripts

None

Implementation

This function is very similar to join_equalized_pepts_and_entries, but joins the original peptide sequences (where I and L are considered to be different) with NCBI taxa IDs (instead of the equalized peptide sequences).

Output files

aa_sequence_taxon_original.tsv.gz (intermediary file)

`number_sequences`

Input files

aa_sequence_taxon_equalized.tsv.gz (intermediary file)
aa_sequence_taxon_original.tsv.gz (intermediary file)

Helper scripts

None

Implementation

This function starts by creating two FIFO files and writing data to them:

equalized: By reading in the intermediary file aa_sequence_taxon_equalized.tsv.gz, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique equalized peptides to equalized.

AAAAA
AAAAAA
AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
AAAAAAAAAAAAAAAAGATCLER
AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
AAAAAAAAAAAAAAAASAGGK
AAAAAAAAAAAAAAAGAAGK
AAAAAAAAAAAAAAAGAGAGAK

original: By reading in the intermediary file aa_sequence_taxon_original.tsv.gz, this function will extract the first column (containing the peptide sequences), sorts them and writes only the unique original peptides to original
Finally, both these temporary FIFO files are combined and only the unique sequence (over the two files) are kept, alphabetically sorted, numbered and written to the intermediary file sequences.tsv.gz:

1	AAAAA
2	AAAAAA
3	AAAAAAAAA
4	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
5	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
6	AAAAAAAAAAAAAAAAGATCLER
7	AAAAAAAAAAAAAAAAGVGGMGELGVNGEK
8	AAAAAAAAAAAAAAAASAGGK

Output files

sequences.tsv.gz (intermediary file)

`calculate_equalized_lcas`

Input files

sequences.tsv.gz (intermediary file)
aa_sequence_taxon_equalized.tsv.gz (intermediary file)
lineages.tsv.gz (final output file)

Helper scripts

LineagesSequencesTaxons2LCAs.jar

Implementation

The function starts by joining the intermediary sequences.tsv.gz and aa_sequence_taxon_equalized.tsv.gz files by peptide sequence:

Next, the data above is sent via stdin to the LineagesSequencesTaxons2LCAs.jar application for further processing. Since most peptides occur multiple times in the input for this application, it is able to collect and aggregate all taxa IDs that belong to one peptide and compute the lowest common ancestor for each peptide. The final output will be sent to stdout.
Finally, the output of step 2 will be compressed and written to a new intermediary file LCAs_equalized.tsv.gz:

Output files

LCAs_equalized.tsv.gz (intermediary file)

`calculate_original_lcas`

Input files

sequences.tsv.gz (intermediary file)
aa_sequence_taxon_original.tsv.gz (intermediary file)
lineages.tsv.gz (final output file)

Helper scripts

LineagesSequencesTaxons2LCAs.jar

Implementation

This function works identical to calculate_equalized_lcas.

Output files

LCAs_original.tsv.gz (intermediary file)

`substitute_equalized_aas`

Input files

peptides.tsv.gz (intermediary file)
sequences.tsv.gz (intermediary file)

Helper scripts

None

Implementation

First, this function will read the peptides.tsv.gz file and sort it according to the equalized peptide sequence string (2nd column).
Secondly, whenever the sorting is complete, the peptides.tsv.gz file will be joined with the sequences.tsv.gz file using the peptide sequence and produces the following result (which is compressed and written to the file peptides_by_equalized.tsv.gz):

11161754	1	AAAAA	470096	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161773	1	AAAAA	470097	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161792	1	AAAAA	470098	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161811	1	AAAAA	470099	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631

Output files

peptides_by_equalized.tsv.gz (intermediary file)

`substitute_original_aas`

Input files

peptides_by_equalized.tsv.gz (intermediary file)
sequences.tsv.gz (intermediary file)

Helper scripts

None

Implementation

The implementation of this function is completely analog to the one of substitute_equalized_aas.

Output files

peptides_by_original.tsv.gz (intermediary file)

`calculate_equalized_fas`

Input files

peptides_by_equalized.tsv.gz (intermediary file)

Helper scripts

FunctionalAnalysisPeptides.js

Implementation

The function starts by creating a new FIFO file peptides_eq that consists of the following data:

1	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
1	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631

peptides_eq thus contains a list of lines consisting of sequence IDs associated with a list of functional annotations.

Most sequence IDs appear more than once in peptides_eq, indicating that all functional annotations of equal sequence IDs belong together and need to be aggregated by the helper script FunctionalAnalysisPeptides.js. This script produces a new intermediary file FAs_equalized.tsv.gz that looks like this:

1	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
2	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
3	{"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}}

Output files

FAs_equalized.tsv.gz (intermediary file)

`calculate_original_fas`

Input files

peptides_by_original.tsv.gz (intermediary file)

Helper scripts

FunctionalAnalysisPeptides.js

Implementation

The implementation of this function is completely analog to the one of calculate_equalized_fas.

Output files

FAs_original.tsv.gz (intermediary file)

`sort_peptides`

Input files

peptides_by_original.tsv.gz

Helper scripts

None

Implementation

Simply sorts all peptides by peptide ID.

Output files

peptides.tsv.gz (final output file)

`create_sequence_table`

Input files

LCAs_original.tsv.gz (intermediary file)
LCAs_equalized.tsv.gz (intermediary file)
FAs_original.tsv.gz (intermediary file)
FAs_equalized.tsv.gz (intermediary file)
sequences.tsv.gz (intermediary file)

Helper scripts

None

Implementation

The first starts by creating 4 different FIFO files.

olcas: This temporary FIFO file contains a mapping between peptide sequence IDs and LCA IDs in the case that the amino acids I and L are considered to be different:

000000000001	1
000000000002	87882
000000000003	272568
000000000004	7227
000000000006	9606
000000000007	502779

elcas: Similar to the olcas FIFO, but for the case that I and L are considered equal:

000000000001	1
000000000002	87882
000000000003	272568
000000000005	7227
000000000006	9606
000000000007	502779

ofas: This temporary FIFO file contains a mapping between peptide sequence IDs and a functional annotations object (in the case that I and L are considered to be different):

000000000001	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}

efas: Similar to the ofas FIFO, but for the case that I and L are considered equal:

000000000001	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}

The fifth step in this function performs most of the work and joins all FIFOs created above.

We start with the intermediary sequences.tsv.gz file we generated previously:

000000000001	AAAAA
000000000002	AAAAAA
000000000003	AAAAAAAAA
000000000004	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
000000000005	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
000000000006	AAAAAAAAAAAAAAAAGATCLER
000000000007	AAAAAAAAAAAAAAAAGVGGMGELGVNGEK

During the first join operation, we join together the sequences and the olcas FIFO which generates the following data (the LCA of the original peptide sequence is now added to the file):

000000000001	AAAAA	1
000000000002	AAAAAA	87882
000000000003	AAAAAAAAA	272568
000000000004	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK	7227
000000000005	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK	\N

The second join operation does something similar and adds the LCA of the equalized peptide sequence to the file:

000000000001	AAAAA	1	1
000000000002	AAAAAA	87882	87882
000000000003	AAAAAAAAA	272568	272568
000000000004	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK	7227	\N
000000000005	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK	\N	7227
000000000006	AAAAAAAAAAAAAAAAGATCLER	9606	9606

The third join adds the functional annotations of the original peptide sequence:

000000000001	AAAAA	1	1	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002	AAAAAA	87882	87882	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}

And finally, the fourth join adds the functional annotations of the equalized peptide sequence:

000000000001	AAAAA	1	1	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
000000000002	AAAAAA	87882	87882	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}

The final result (as can be seen above) is compressed and written to the final output file sequences.tsv.gz.

Output files

sequences.tsv.gz (final output file)