3. Download files

Overview

reMap requires a set of object files to run the core commands along with test samples to train and predict pathways. The test samples can either be used to train or test the reMap model. Please download these files from XXX. Once you have downloaded the reMap_materials.zip file, unzip it and make sure you obtain the two folders: model/ and dataset/, as depicted below:

Note: This tree structure for the directory was generated using the tree command in the terminal (on Linux) and in the command prompt (on Windows).

reMap_materials/
	├── model/
        │       ├── reMap.pkl
        │       ├── hin.pkl
        │       ├── pathway2vec_embeddings.npz
        │       ├── phi.npz
        │       └── sigma.npz
	└── dataset/
                ├── biocyc205_tier23_9255_[X, B, y, species].pkl
                ├── biocyc21_[X, B, y, species].pkl
                ├── golden_X.pkl, golden_B.pkl, golden_y.pkl
                ├── cami_X.pkl, cami_B.pkl, cami_y.pkl
                ├── centroid.npz
                ├── features.npz
                ├── rho.npz
                ├── pathway_group.pkl
                ├── idxvocab.pkl
                ├── vocab.pkl
                └── ...

A short description of the contents of the above folders is given below.

model/

In this folder, a pre-trained model is provided to predict metabolic pathways using the datasets described in the dataset/ section.

File	Description	Size
reMap.pkl	A pretrained model generated using biocyc205_tier23_9255_Xe.pkl and biocyc205_tier23_9255_y.pkl data. This model was trained using SOAP with supplementary pathway information.	70.5MB
hin.pkl	A sample of heterogeneous information network.	10.0MB
pathway2vec_embeddings.npz	A matrix file containing a sample of embeddings using RUST-norm. The rows (22593) correspond to the pathway, enzyme, and compound embeddings and the columns (128) represent the features. These features can be generated using pathway2vec.	11.0MB
sigma.npz	A matrix file representing the group-group covariance of size 200. This data was obtained using SOAP with supplementary pathway information.	312KB
phi.npz	A matrix file representing the distribution of pathways over groups. The rows (200) correspond to the group indices and columns (2526) represent the pathway indices. This data was obtained using SOAP with supplementary pathway information.	3.85MB

Here, we show you a visual depiction of some of the object files to help deepen your understanding.

pathway2vec_embeddings.npz

The pathway2vec_embeddings.npz is a matrix file corresponding to the embeddings of pathways, EC numbers, and compounds. These features are generated using pathway2vec. For example, after including pathway and EC numbers from "biocyc.pkl" in the first column and excluding compounds, the table can be seen as:

Pathway and EC	1	2	3	4	5	6	7	8	9	10
L-valine biosynthesis	0.089106	0.092924	0.089035	0.101823	0.072792	0.083173	0.096259	0.064823	0.071481	0.094392
methylquercetin biosynthesis	0.112329	0.075717	0.087717	0.094391	0.081035	0.074514	0.095572	0.072581	0.068458	0.096449
cyanide degradation	0.073566	0.094817	0.087664	0.099661	0.089182	0.103727	0.093147	0.093047	0.083330	0.095017
...	...	...	...	...	...	...	...	...	...	...
EC-1.1.1.10	0.095318	0.094138	0.097567	0.087115	0.084483	0.098668	0.078173	0.091465	0.086675	0.086497
EC-1.1.1.100	0.047987	0.096748	0.092529	0.092395	0.116745	0.092556	0.106274	0.107414	0.079025	0.098948
EC-1.1.1.101	0.090137	0.085566	0.087589	0.089496	0.082936	0.088855	0.083835	0.091411	0.085721	0.090588
...	...	...	...	...	...	...	...	...	...	...

phi.npz

This is a matrix file corresponding to the distribution of pathways over groups. Rows correspond to group indices and columns represent pathway indices. For example, the table can be seen as:

Pathway Group Indices	5-aminoimidazole ribonucleotide biosynthesis II	vitamin E biosynthesis (tocopherols)	spermine and spermidine degradation III	biotin biosynthesis from 8-amino-7-oxononanoate I	mixed acid fermentation	L-glutamate degradation II	chlorosalicylate degradation	L-malate degradation II	pyruvate fermentation to acetate II	acetoin degradation
0	1.429221e-07	3.524164e-02	1.607106e-07	1.533687e-07	1.528739e-07	1.512877e-07	1.170707e-07	1.470176e-01	1.524868e-07	1.453987e-07
1	6.455780e-08	6.944885e-08	6.996916e-08	5.668292e-01	5.653213e-08	5.686886e-08	6.443087e-08	6.507242e-08	3.832408e-01	6.670976e-08
2	1.367094e-07	1.453373e-07	1.535185e-07	1.362638e-07	1.316325e-07	1.209321e-07	1.535094e-07	1.433702e-07	1.200712e-07	4.520255e-01
3	6.353614e-01	3.844192e-08	3.191540e-08	3.330884e-08	3.522579e-01	3.217529e-08	3.045689e-08	3.111319e-08	2.996089e-08	3.107714e-08
4	7.128483e-08	7.209904e-08	6.580494e-08	7.720455e-08	6.120852e-08	6.733139e-01	7.705214e-08	7.127394e-08	7.585665e-08	7.150897e-08

It can be seen that the 5-aminoimidazole ribonucleotide biosynthesis II pathway is most likely to be associated with the group indexed by 3 (the highest value).

dataset/

In this folder, 20 data are provided to predict, train, and evaluate metabolic pathways using the pre-trained reMap model (e.g., "reMap.pkl") or to train a new model. The data are categorized into the following three types: 1)- pathway training data, 2)- pathway test data, and 3)- other necessary data items.

1. Pathway training data

The following four files can be used to train reMap. Biocyc tier 2 and 3 PGDBs were processed using prepBioCyc.

File	Description	Size
biocyc205_tier23_9255_X.pkl	A matrix file of 9255 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) for each organism, represent EC number indices filled with integer values indicating the abundance of ECs for that organism.	25.4MB
biocyc205_tier23_9255_B.pkl	A +1/-1 matrix indicating the presence/absence of group indices (200 entries) for each of the 9255 organisms.	19.5MB
biocyc205_tier23_9255_y.pkl	A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9255 organisms.	63.3MB
biocyc205_tier23_9255_species.pkl	A metadata in a tuple format (folder id, taxa id, species) consisting of file metadata information, extracted from Biocyc (v20.5) tier 2 and 3 PGDBs.	6.35MB
biocyc21_X.pkl	A matrix file of 9429 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) for each organism, represent EC number indices filled with integer values indicating the abundance of ECs for that organism.	25.8MB
biocyc21_B.pkl	A +1/-1 matrix indicating the presence/absence of group indices (200 entries) for each of the 9429 organisms.	19.5MB
biocyc21_y.pkl	A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9429 organisms.	63.3MB
biocyc21_species.pkl	A metadata in a tuple format (folder id, taxa id, species) consisting of file metadata information, extracted from Biocyc (v21) tier 2 and 3 PGDBs.	6.35MB

The following table depicts biocyc21_X.pkl (after including taxa and species information from "biocyc21_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents species name. The remaining columns represent EC numbers.

Taxa	Species	EC-1.1.1.101	EC-6.6.1.1	EC-6.6.1.2
TAX-887700	Acetobacter aceti	1.0	1.0	1.0
TAX-1048834	Alicyclobacillus acidocaldarius	1.0	1.0	0.0
TAX-521098	Alicyclobacillus acidocaldarius	0.0	1.0	0.0
TAX-1035194	Aggregatibacter actinomycetemcomitans	1.0	0.0	0.0
TAX-1089447	Aggregatibacter actinomycetemcomitans	1.0	0.0	0.0

The following table depicts biocyc21_B.pkl (after including taxa and species information from "biocyc21_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents species name. The remaining columns represent pathway group indices.

Taxa	Species	0	1	2	3	4	5	6	7	8	9
TAX-887700	Acetobacter aceti	-1.0	-1.0	-1.0	-1.0	-1.0	1.0	1.0	1.0	-1.0	-1.0
TAX-1048834	Alicyclobacillus acidocaldarius	-1.0	-1.0	1.0	1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0
TAX-521098	Alicyclobacillus acidocaldarius	1.0	-1.0	-1.0	1.0	-1.0	1.0	-1.0	-1.0	-1.0	-1.0
TAX-1035194	Aggregatibacter actinomycetemcomitans	-1.0	-1.0	1.0	1.0	-1.0	-1.0	1.0	-1.0	-1.0	1.0
TAX-1089447	Aggregatibacter actinomycetemcomitans	-1.0	-1.0	1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	1.0

The following table depicts biocyc21_y.pkl (after including taxa and species information from "biocyc21_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents organism information. The remaining columns represent the pathways.

Taxa	Species	L-valine biosynthesis	L-arginine degradation VI (arginase 2 pathway)	cyclopropane fatty acid (CFA) biosynthesis	almitate biosynthesis II (bacteria and plants)	pyridoxal 5'-phosphate salvage I	adenosine deoxyribonucleotides de novo biosynthesis
TAX-887700	Acetobacter aceti	1.0	0.0	1.0	1.0	1.0	1.0
TAX-1048834	Alicyclobacillus acidocaldarius	1.0	1.0	0.0	1.0	0.0	1.0
TAX-521098	Alicyclobacillus acidocaldarius	1.0	1.0	0.0	1.0	0.0	1.0
TAX-1035194	Aggregatibacter actinomycetemcomitans	0.0	0.0	0.0	1.0	1.0	1.0
TAX-1089447	Aggregatibacter actinomycetemcomitans	1.0	0.0	0.0	1.0	1.0	1.0

2. Pathway test data

The following data can be used to perform pathway prediction and evaluation of the pre-trained reMap model. Please see the mlLGPR repository and Advanced usage on how to obtain and preprocess the data below.

Files	Description	Size
golden_X.pkl, golden_B.pkl, golden_y.pkl	This is the Golden dataset in a matrix format where rows correspond to AraCyc, EcoCyc, HumanCyc, LeishCyc, TrypanoCyc, and YeastCyc, respectively. Columns for "X.pkl", "B.pkl", and "*y.pkl" correspond to 3650 EC number indices, 200 group indices, and 2526 pathway indices.	154KB
cami_X.pkl, cami_B.pkl, cami_y.pkl	These files correspond to the CAMI low complexity data with the rows representing 40 species. Columns for "X.pkl", "B.pkl", and "*y.pkl" correspond to 3650 EC number indices, 200 group indices, and 2526 pathway indices.	396KB

3. Other necessary data items

reMap requires additional data items for training and transformation.

Files	Description	Size
centroid.npz	A matrix file representing the 200 groups centroids of size 128.	79.1KB
features.npz	A matrix file representing pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to `pathway2vec_embeddings.npz`.	1.23MB
rho.npz	A matrix file representing the group-group correlations of size 200.	312KB
pathway_group.pkl	A binary matrix indicating the association of bags indices (200 entries) in rows to pathway indices (2526 entries) in columns.	3.85MB
idxvocab.pkl	A file representing the pathway indices.	19.8KB
vocab.pkl	A dictionary file representing pathway indices as keys and MetaCyc pathway ids as values.	52.5KB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly