Skip to content

Intermediary files

Pieter Verschaffelt edited this page Mar 22, 2023 · 3 revisions

This page lists all intermediary files that are generated by the Unipept database construction script. Most of these files are compressed TSV-files that represent a specific step in the construction of peptide databases.

FAs_equalized.tsv.gz

Every line in this file corresponds to a mapping between a peptide and a JSON-object containing all functional annotations associated with this peptide (assuming that the amino acids I and L are equal).

Example

1	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
2	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
3	{"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}}

Columns

  1. peptide_id: The ID of the peptide that this aggregation of functional annotations is associated with.
  2. functional_annotations: A JSON-object that describes all functional annotations that this peptide is associated with, including a count value. This count reports how many proteins (in which the original peptide occurs) are associated with the specific function.

FAs_original.tsv.gz

Every line in this file corresponds to a mapping between a peptide and a JSON-object containing all functional annotations associated with this peptide (assuming that the amino acids I and L are not equal).

Example

1	{"num":{"all":17,"EC":17,"GO":17,"IPR":17},"data":{"GO:0004477":14,"GO:0004488":14,"GO:0000105":14,"GO:0009086":14,"GO:0006164":14,"GO:0035999":14,"EC:1.5.1.5":14,"EC:3.5.4.9":14,"IPR:IPR046346":14,"IPR:IPR036291":14,"IPR:IPR000672":14,"IPR:IPR020630":14,"IPR:IPR020867":14,"IPR:IPR020631":14,"GO:0033201":1,"GO:0004373":1,"GO:0009011":1,"GO:0005978":1,"EC:2.4.1.21":1,"IPR:IPR001296":1,"IPR:IPR011835":1,"IPR:IPR013534":1,"GO:0044219":1,"GO:0039617":1,"GO:0003677":2,"GO:0005525":1,"GO:0003723":1,"GO:0005198":1,"GO:0046740":1,"EC:":2,"IPR:IPR003181":1,"IPR:IPR003182":1,"IPR:IPR029053":1,"GO:0000428":1,"GO:0001216":1,"GO:0016779":1,"GO:0016987":1,"GO:0006352":1,"GO:0009399":1,"IPR:IPR000394":1,"IPR:IPR007046":1,"IPR:IPR007634":1,"IPR:IPR038709":1}}
2	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}

Columns

  1. peptide_id: The ID of the peptide that this aggregation of functional annotations is associated with.
  2. functional_annotations: A JSON-object that describes all functional annotations that this peptide is associated with, including a count value. This count reports how many proteins (in which the original peptide occurs) are associated with the specific function.

LCAs_equalized.tsv.gz

Every line in this file corresponds to a mapping between a peptide and its LCA. All taxa are aggregated of the proteins in which each peptide occurs. Afterwards, the lowest common ancestor of these taxa is calculated and linked to this peptide. It is assumed that the amino acids I and L are equal for the peptides in this file.

Example

1	1
2	87882
3	272568
5	7227
6	9606
7	502779

Columns

  1. peptide_id: The ID of the peptide that this LCA is associated with.
  2. lca: The NCBI taxon ID for the lowest common ancestor of this peptide.

LCAs_original.tsv.gz

Every line in this file corresponds to a mapping between a peptide and its LCA. All taxa are aggregated of the proteins in which each peptide occurs. Afterwards, the lowest common ancestor of these taxa is calculated and linked to this peptide. It is assumed that the amino acids I and L are not equal for the peptides in this file.

Example

1	1
2	87882
3	272568
4	7227
6	9606

Columns

  1. peptide_id: The ID of the peptide that this LCA is associated with.
  2. lca: The NCBI taxon ID for the lowest common ancestor of this peptide.

aa_sequence_taxon_equalized.tsv.gz

Each line in this file corresponds to a mapping between a tryptic peptide sequence and the ID of the UniProt entry that originated from. One peptide is typically linked to multiple UniProt entries (since a tryptic peptide is typically found in more than one protein). This file assumes that the amino acids I and L are equal (thus a peptide AALI will also be matched with a protein that contains AALL or AAII).

Example

AAAAA	1063
AAAAA	243160
AAAAA	271848
AAAAA	272560
AAAAA	31716
AAAAA	320372
AAAAA	320373

Columns

  1. peptide_sequence: The tryptic peptide sequence as previously digested from a protein.
  2. uniprot_entry_id: ID of the UniProt entry from which the tryptic peptide sequence originated.

aa_sequence_taxon_original.tsv.gz

Each line in this file corresponds to a mapping between a tryptic peptide sequence and the ID of the UniProt entry that originated from. One peptide is typically linked to multiple UniProt entries (since a tryptic peptide is typically found in more than one protein). This file assumes that the amino acids I and L are not equal.

Example

AAAAA	1063
AAAAA	243160
AAAAA	271848
AAAAA	272560
AAAAA	31716
AAAAA	320372

Columns

  1. peptide_sequence: The tryptic peptide sequence as previously digested from a protein.
  2. uniprot_entry_id: ID of the UniProt entry from which the tryptic peptide sequence originated.

peptides.tsv.gz

This file contains all tryptic peptides that are the result of an in-silico tryptic digest of all input proteins. Both the original and equalized (in which all I's are replaced by L) sequences are present in here.

Example

1	GGLSVPGPMGPSGPR	GGISVPGPMGPSGPR	1	GO:0005615;EC:;IPR:IPR008160
2	GLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPR	GLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPR	1	GO:0005615;EC:;IPR:IPR008160
3	GPPGPPGK	GPPGPPGK	1	GO:0005615;EC:;IPR:IPR008160
4	NGDDGEAGKPGRPGER	NGDDGEAGKPGRPGER	1	GO:0005615;EC:;IPR:IPR008160
5	GPPGPQGAR	GPPGPQGAR	1	GO:0005615;EC:;IPR:IPR008160

Columns

  1. id: A temporary ID used to identify each of these lines.
  2. equalized_peptide_sequence: A version of the peptide sequence in which all I's are replaced by L.
  3. original_peptide_sequence: The original tryptic peptide sequence (no replacements have been made here).
  4. uniprot_entry_id: ID of the UniProt entry from which this degisted tryptic peptide sequence originates.
  5. functional_annotations: A list of all functional annotations of the protein from which this tryptic peptide originates (this is thus restricted to one protein here!).

peptides_by_equalized.tsv.gz

Example

11161754	1	AAAAA	470096	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161773	1	AAAAA	470097	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161792	1	AAAAA	470098	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161811	1	AAAAA	470099	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631

Columns

  1. id: A temporary ID used to identify each of these lines.
  2. equalized_sequence_id: ID of this peptide sequence (where I and L are considered to be equal), as used in the sequences.tsv.gz file.
  3. original_sequence: Original tryptic peptide sequence (where I and L are considered to be different).
  4. uniprot_entry_id: ID of the UniProt entry from which this peptide originates.
  5. functional_annotations: List of functional annotations associated with this peptide, for this UniProt entry.

peptides_by_original.tsv.gz

Example

11161754	1	1	470096	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161773	1	1	470097	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161792	1	1	470098	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631
11161811	1	1	470099	GO:0004477;GO:0004488;GO:0000105;GO:0009086;GO:0006164;GO:0035999;EC:1.5.1.5;EC:3.5.4.9;IPR:IPR046346;IPR:IPR036291;IPR:IPR000672;IPR:IPR020630;IPR:IPR020867;IPR:IPR020631

Columns

  1. id: A temporary ID used to identify each of these lines.
  2. equalized_sequence_id: ID of this peptide sequence (where I and L are considered to be equal), as used in the sequences.tsv.gz file.
  3. original_sequence_id: ID of this peptide sequence (where I and L are considered not to be equal), as used in the sequences.tsv.gz file.
  4. uniprot_entry_id: ID of the UniProt entry from which this peptide originates.
  5. functional_annotations: List of functional annotations associated with this peptide, for this UniProt entry.

sequences.tsv.gz

This file contains a list of all tryptic peptides that are the result of in-silico tryptic digest of the provided input databases. A unique identifier is generated for each of these sequences and will be referred to by other files.

Example

1	AAAAA
2	AAAAAA
3	AAAAAAAAA
4	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK
5	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSALSPGSK
6	AAAAAAAAAAAAAAAAGATCLER
7	AAAAAAAAAAAAAAAAGVGGMGELGVNGEK

Columns

  1. sequence_id: A unique identifier for this tryptic peptide sequence.
  2. peptide_sequence: The tryptic peptide sequence itself.
Clone this wiki locally