-
Notifications
You must be signed in to change notification settings - Fork 0
3. Download files
reMap requires a set of object files to run the core commands along with test samples to train and predict pathways. The test samples can either be used to train or test the reMap model. Please download these files from XXX. Once you have downloaded the reMap_materials.zip
file, unzip it and make sure you obtain the two folders: model/
and dataset/
, as depicted below:
Note: This tree structure for the directory was generated using the tree
command in the terminal
(on Linux) and in the command prompt
(on Windows).
reMap_materials/
├── model/
│ ├── reMap.pkl
│ ├── hin.pkl
│ ├── pathway2vec_embeddings.npz
│ ├── phi.npz
│ └── sigma.npz
└── dataset/
├── biocyc205_tier23_9255_[X, B, y, species].pkl
├── biocyc21_[X, B, y, species].pkl
├── golden_X.pkl, golden_B.pkl, golden_y.pkl
├── cami_X.pkl, cami_B.pkl, cami_y.pkl
├── centroid.npz
├── features.npz
├── rho.npz
├── pathway_group.pkl
├── idxvocab.pkl
├── vocab.pkl
└── ...
A short description of the contents of the above folders is given below.
In this folder, a pre-trained model is provided to predict metabolic pathways using the datasets described in the dataset/ section.
File | Description | Size |
---|---|---|
reMap.pkl | A pretrained model generated using biocyc205_tier23_9255_Xe.pkl and biocyc205_tier23_9255_y.pkl data with 90 pathway community and 100 enzyme community. | 105MB |
hin.pkl | A sample of heterogeneous information network. | 10.5MB |
pathway2vec_embeddings.npz | A matrix file containing a sample of embeddings using RUST-norm. The rows (22593) correspond to the pathway, enzyme, and compound embeddings and the columns (128) represent the features. These features can be generated using pathway2vec. | 11.6MB |
Here, we show you a visual depiction of some of the object files to help deepen your understanding.
The pathway2vec_embeddings.npz
is a matrix file corresponding to the embeddings of pathways, EC numbers, and compounds. These features are generated using pathway2vec. For example, after including pathway and EC numbers from "biocyc.pkl" in the first column and excluding compounds, the table can be seen as:
Pathway and EC | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
L-valine biosynthesis | 0.089106 | 0.092924 | 0.089035 | 0.101823 | 0.072792 | 0.083173 | 0.096259 | 0.064823 | 0.071481 | 0.094392 |
methylquercetin biosynthesis | 0.112329 | 0.075717 | 0.087717 | 0.094391 | 0.081035 | 0.074514 | 0.095572 | 0.072581 | 0.068458 | 0.096449 |
cyanide degradation | 0.073566 | 0.094817 | 0.087664 | 0.099661 | 0.089182 | 0.103727 | 0.093147 | 0.093047 | 0.083330 | 0.095017 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
EC-1.1.1.10 | 0.095318 | 0.094138 | 0.097567 | 0.087115 | 0.084483 | 0.098668 | 0.078173 | 0.091465 | 0.086675 | 0.086497 |
EC-1.1.1.100 | 0.047987 | 0.096748 | 0.092529 | 0.092395 | 0.116745 | 0.092556 | 0.106274 | 0.107414 | 0.079025 | 0.098948 |
EC-1.1.1.101 | 0.090137 | 0.085566 | 0.087589 | 0.089496 | 0.082936 | 0.088855 | 0.083835 | 0.091411 | 0.085721 | 0.090588 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
This is a matrix file corresponding to the communities of pathways. Rows correspond to pathway indices and columns represent community indices. For example, after including pathways from "biocyc.pkl" in the first column, the table can be seen as:
Pathway | 0 | 2 | 6 | 10 | 14 | 19 | 25 | 26 | 81 | 82 |
---|---|---|---|---|---|---|---|---|---|---|
L-valine biosynthesis | 0.000000e+00 | 0.000000 | 0.014566 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.381181 | 0.000000e+00 | 0.000000 |
L-arginine degradation VI (arginase 2 pathway) | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.268884 |
cyclopropane fatty acid (CFA) biosynthesis | 0.000000e+00 | 1.945544 | 0.000000 | 1.159008e-36 | 5.421540e-71 | 1.427154e-200 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 |
palmitate biosynthesis II (bacteria and plants) | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.476554 | 5.870750e-308 | 0.000000 |
jasmonoyl-amino acid conjugates biosynthesis I | 1.485445e-316 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.700195 | 0.000000 | 0.000000e+00 | 0.000000 |
It can be seen that the L-valine biosynthesis pathway is most likely to be grouped under the community indexed by 26 (the highest value).
This is a matrix file corresponding to the communities of enzymes (represented by EC number indices). Rows correspond to EC number indices and columns represent community indices. For example, after including pathways from "biocyc.pkl" in the first column, the table can be seen as:
EC | 8 | 23 | 27 | 34 | 54 | 58 | 62 | 72 | 82 | 98 |
---|---|---|---|---|---|---|---|---|---|---|
EC-1.1.1.100 | 0.000000 | 0.00000 | 1.725058e-110 | 0.000000 | 0.000000 | 0.000000 | 0.213211 | 0.000000 | 0.000000 | 0.129857 |
EC-1.1.1.101 | 0.000000 | 0.00000 | 0.000000e+00 | 0.000000 | 0.095322 | 0.179075 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
EC-1.1.1.102 | 0.010917 | 0.08915 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
EC-1.1.1.103 | 0.000000 | 0.00000 | 3.676552e-02 | 0.017895 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
EC-1.1.1.105 | 0.000000 | 0.00000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.401225 | 0.004426 | 0.000000 |
It can be seen that the 3-oxoacyl-[acyl-carrier-protein] reductase (EC-1.1.1.100) enzyme is most likely to be grouped under the community indexed by 62 (the highest value).
In this folder, 26 data are provided to predict, train, and evaluate metabolic pathways using the pre-trained reMap model (e.g., "reMap.pkl") or to train a new model. The data are categorized into the following three types: 1)- pathway training data, 2)- pathway test data, and 3)- other necessary data items.
The following four files can be used to train reMap. Biocyc (v20.5) tier 2 and 3 PGDBs were processed using prepBioCyc.
File | Description | Size |
---|---|---|
biocyc205_tier23_9255_X.pkl | A matrix file of 9255 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) for each organism, represent EC number indices filled with integer values indicating the abundance of ECs for that organism. | 25.4MB |
biocyc205_tier23_9255_B.pkl | A +1/-1 matrix indicating the presence/absence of group indices (200 entries) for each of the 9255 organisms. | 19.5MB |
biocyc205_tier23_9255_y.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9255 organisms. | 63.3MB |
biocyc205_tier23_9255_species.pkl | A metadata in a tuple format (folder id, taxa id, species) consisting of file metadata information, extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. | 6.35MB |
biocyc21_X.pkl | A matrix file of 9429 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) for each organism, represent EC number indices filled with integer values indicating the abundance of ECs for that organism. | 25.8MB |
biocyc21_B.pkl | A +1/-1 matrix indicating the presence/absence of group indices (200 entries) for each of the 9429 organisms. | 19.5MB |
biocyc21_y.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9429 organisms. | 63.3MB |
biocyc21_species.pkl | A metadata in a tuple format (folder id, taxa id, species) consisting of file metadata information, extracted from Biocyc (v21) tier 2 and 3 PGDBs. | 6.35MB |
The following table depicts biocyc205_tier23_9255_X.pkl
(after including taxa and species information from "biocyc205_tier23_9255_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents species name. The remaining columns represent EC number indices.
Taxa | Species | EC-1.1.1.10 | EC-1.1.1.101 | EC-1.1.1.102 | EC-6.4.1.4 | EC-6.4.1.5 | EC-6.4.1.6 | EC-6.4.1.7 | EC-6.4.1.8 | EC-6.4.1.b | EC-6.5.1.8 | EC-6.6.1.1 | EC-6.6.1.2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TAX-887700 | Acetobacter aceti | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
TAX-1048834 | Alicyclobacillus acidocaldarius | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
TAX-521098 | Alicyclobacillus acidocaldarius | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
TAX-1035194 | Aggregatibacter actinomycetemcomitans | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TAX-1089447 | Aggregatibacter actinomycetemcomitans | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The following table depicts biocyc205_tier23_9255_B.pkl
(after including taxa and species information from "biocyc205_tier23_9255_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents species name. The remaining columns represent pathway group indices.
Taxa | Species | EC-1.1.1.10 | EC-1.1.1.101 | EC-1.1.1.102 | EC-6.4.1.4 | EC-6.4.1.5 | EC-6.4.1.6 | EC-6.4.1.7 | EC-6.4.1.8 | EC-6.4.1.b | EC-6.5.1.8 | EC-6.6.1.1 | EC-6.6.1.2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TAX-887700 | Acetobacter aceti | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
TAX-1048834 | Alicyclobacillus acidocaldarius | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
TAX-521098 | Alicyclobacillus acidocaldarius | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
TAX-1035194 | Aggregatibacter actinomycetemcomitans | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TAX-1089447 | Aggregatibacter actinomycetemcomitans | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The following table depicts biocyc205_tier23_9255_y.pkl
(after including taxa and species information from "biocyc205_tier23_9255_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents organism information. The remaining columns represent the pathway indices.
Taxa | Species | L-valine biosynthesis | L-arginine degradation VI (arginase 2 pathway) | cyclopropane fatty acid (CFA) biosynthesis | almitate biosynthesis II (bacteria and plants) | jasmonoyl-amino acid conjugates biosynthesis I | pyridoxal 5'-phosphate salvage I | adenosine deoxyribonucleotides de novo biosynthesis |
---|---|---|---|---|---|---|---|---|
TAX-887700 | Acetobacter aceti | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
TAX-1048834 | Alicyclobacillus acidocaldarius | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
TAX-521098 | Alicyclobacillus acidocaldarius | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
TAX-1035194 | Aggregatibacter actinomycetemcomitans | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 |
TAX-1089447 | Aggregatibacter actinomycetemcomitans | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 |
The following data can be used to perform pathway prediction and evaluation of the pre-trained reMap model. Please see the mlLGPR repository on how to obtain and preprocess the data below.
Files | Description | Size |
---|---|---|
golden_X.pkl, golden_B.pkl, golden_y.pkl | This is the Golden dataset in a matrix format where rows correspond to AraCyc, EcoCyc, HumanCyc, LeishCyc, TrypanoCyc, and YeastCyc, respectively. Columns for "*X.pkl", "*B.pkl", and "*y.pkl" correspond to 3650 EC number indices, 200 group indices, and 2526 pathway indices. | 154KB |
cami_X.pkl, cami_B.pkl, cami_y.pkl | These files correspond to the CAMI low complexity data with the rows representing 40 species. Columns for "*X.pkl", "*B.pkl", and "*y.pkl" correspond to 3650 EC number indices, 200 group indices, and 2526 pathway indices. | 396KB |
reMap requires additional data items for training and transformation.
Files | Description | Size |
---|---|---|
centroid.npz | A matrix file representing the pathway-enzyme association with possible missing links due to white noise. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. The file representation is similar to pathway2ec.pkl . |
79.1KB |
features.npz | A binary matrix file representing the pathway-pathway interaction. It contains 2526 pathway indices shown in the first column with their interactions (2526 pathway indices) in the remaining columns. | 76.8KB |
rho.npz | A binary matrix file representing the enzyme-enzyme interaction. It contains 3650 enzymes (represented as EC numbers indices) shown in the first column with their interactions (3650 EC numbers indices) in the remaining columns. | 152KB |
pathway_group.pkl | A matrix file representing the pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz . |
3.42MB |
idxvocab.pkl | A matrix file representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz . |
4.95MB |
vocab.pkl | A matrix file representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz . |
4.95MB |
The A.pkl
is a binary matrix that contains the pathway-pathway interactions with the entries of the matrix indicate whether pairs of pathways are adjacent or not. For example, in the table below, the palmitate biosynthesis II (bacteria and plants) pathway is adjacent to the palmitoleate biosynthesis II (plants and bacteria) pathway but is not linked to any of the other pathways shown in the table.
Pathway | jasmonic acid biosynthesis | L-proline degradation | palmitoleate biosynthesis II (plants and bacteria) | adenosine ribonucleotides de novo biosynthesis | L-leucine biosynthesis | phosphopantothenate biosynthesis I | L-alanine biosynthesis I | 3-methylbutanol biosynthesis (engineered) | peramine biosynthesis |
---|---|---|---|---|---|---|---|---|---|
L-valine biosynthesis | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
L-arginine degradation VI (arginase 2 pathway) | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
palmitate biosynthesis II (bacteria and plants) | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
jasmonoyl-amino acid conjugates biosynthesis I | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
adenosine deoxyribonucleotides de novo biosynthesis | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
The B.pkl
is a binary matrix that contains the enzyme-enzyme interactions with the entries of the matrix indicate whether pairs of enzymes (represented by EC number indices) are adjacent or not. For example, in the table below, the all-trans-retinol dehydrogenase (NAD+) (EC-1.1.1.105) is adjacent to the retinal dehydrogenase (EC-1.2.1.36) but is not linked to any of the other enzymes shown in the table.
EC | EC-1.2.1.36 | EC-2.3.1.179 | EC-2.3.1.86 | EC-4.2.1.59 | EC-2.3.1.15 | EC-2.3.1.29 | EC-2.3.1.42 | EC-2.3.1.50 | EC-2.3.1.51 | EC-2.5.1.26 | EC-2.7.1.91 |
---|---|---|---|---|---|---|---|---|---|---|---|
EC-1.1.1.100 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
EC-1.1.1.101 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
EC-1.1.1.102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
EC-1.1.1.103 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
EC-1.1.1.105 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |