Skip to content

4. Tutorial on pathway prediction

Abdurrahman Abul-Basher edited this page Jun 3, 2021 · 78 revisions

Overview

leADS is used to accurately predict pathways in both organismal and multi-organismal genomes. This tutorial is meant to walk you through the basic steps of pathway prediction using either your own input data or the test data provided by us. Once the input (in the specified format below), trained model, and other required files are provided, a detailed pathway report is generated that can be used for further analysis.

Note: Make sure to put the source code leADS (Installing leADS) into the same directory as explained in Download files. Also, create a folder result in the same leADS_materials/ directory. The final structure of the folder should look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	└── leADS/
                └── ...

For all experiments, using a terminal navigate to the src folder in the leADS directory and then run the commands. To display leADS's running options use: python main.py --help. It should be self-contained.

Input:

The input for pathway predictions is either a .pf file generated directly by MetaPathways v2 or a .pkl file generated after following the steps under Advanced usage. One can also use the files provided by us.

Files required for predictions:

In addition to the input data, some of the object files listed here are also required to carry out a successful run. The required object files include:

  • biocyc.pkl
  • pathway2ec.pkl
  • pathway2ec_idx.pkl
  • hin.pkl
  • pathway2vec_embeddings.npz
  • leADS.pkl

Command:

The basic command looks like this:

python main.py \
--predict \
--pred-labels \
--pathway-report \
--parse-pf \
--soft-voting \
--object-name "biocyc.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_Xe.pkl" \
--file-name "[input (or save) file name]" \
--model-name "leADS" \
--dsfolder "[input dataset folder]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--batch 50\
--num-jobs 2 

Note:

  1. Keep in mind that while running the command with --parse-pf, DATANAME in both arguments --X-name "[DATANAME]_Xe.pkl" and --file-name "[DATANAME]" should be the same.
  2. Use the --parse-pf flag only if the input to leADS is a .pf file(s).

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Value
--predict Predicting bags_labels distribution from inputs using leADS False
--pred-label Predicting labels in input False
--pathway-report Enables to generate a detailed report for pathways for each instance False
--parse-pf Boolean variable implying whether to parse Pathologic format file (.pf) from a given folder False
--soft-voting Boolean variable indicating whether to predict labels based on the calibrated sums of the predicted probabilities from an ensemble False
--object-name The preprocessed MetaCyc database file used for predictions biocyc.pkl
--pathway2ec-name The matrix file representing Pathway-EC association pathway2ec.pkl
--pathway2ec-idx-name The pathway2ec association indices file pathway2ec_idx.pkl
--hin-name The heterogeneous information network file hin.pkl
--features-name The features corresponding ECs and pathways pathway2vec_embeddings.npz
--X-name The Input file name to be provided for predictions [DATANAME]_Xe.pkl
--file-name The names of input preprocessed files (without extension) [input (or save) file name]
--model-name The name of the model excluding any EXTENSION leADS
--dsfolder The dataset folder name [Input dataset folder]
--ospath The path to the data object that contains extracted information from the MetaCyc database (e.g. biocyc.pkl) Outside source code
--dspath The path to the datasets Outside source code
--mdpath The path to the pre-trained model (e.g. leADS.pkl) Outside source code
--rspath The path to store results Outside source code
--batch Batch size 50
--num-jobs The number of parallel workers 2

Note: The default values can be changed in the main.py file before running the command to suit user requirements.

Output:

The output files generated after running the command are:

With --parse-pf flag: (see Example 1):

File Description
[DATANAME]_Xa.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the concatenated abundance features in the columns
[DATANAME]_Xc.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated coverage features in the columns
[DATANAME]_Xe.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the concatenated EC features in the columns
[DATANAME]_Xea.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated EC and abundance features in the columns
[DATANAME]_Xec.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated EC and coverage features in the columns
[DATANAME]_Xm.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated EC, abundance, and coverage features in the columns
[DATANAME]_Xp.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the transformed instances to EC in the columns
[DATANAME]_labels_leads.pkl A file (stored in the "dspath" location) representing a list of pathway names for each sample
[DATANAME]_y_leads.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the pathway indices in the columns
pathway_report.tsv Pathway report files (tab-separated values) indicating all predicted pathways, their associated scores, abundance values, and the mapped EC numbers. These files are stored in each folder of each organisms ids in the "rspath" location

Without --parse-pf flag: (see Example 2):

File Description
[DATANAME]_y_leads.pkl A file (stored in the "dspath" location) representing organisms ids as rows and the pathway indices in the columns
[DATANAME]_labels_leads.pkl A file (stored in the "dspath" location) representing a list of pathway names for each sample
pathway_report.tsv Pathway report files (tab-separated values) indicating all predicted pathways, their associated scores, abundance values, and the mapped EC numbers. These files are stored in each folder of each organisms ids in the "rspath" location

Without --pathway-report and --parse-pf flags: (see Example 3)

File Description
[DATANAME]_y_leads.pkl A file (stored in the "dspath" location) representing organisms ids as rows and the pathway indices in the columns

Examples

Example 1:

To predict outputs and compile pathway report from the "three_ecoli" data, generated by MetaPathways v2, using a pre-trained model ("leADS.pkl"), execute the following command:

python main.py --predict --pred-labels --pathway-report --parse-pf --soft-voting --object-name "biocyc.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --pathway2ec-name "pathway2ec.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "three_ecoli_Xe.pkl" --file-name "three_ecoli" --model-name "leADS" --dsfolder "three_ecoli" --num-jobs 2

As you can see, in both arguments: --X-name "three_ecoli_Xe.pkl" and --file-name "three_ecoli", the name **three_ecoli ** is same.

After running the command, the output will be saved to the dataset/ and result/ folders. Since the --build-features flag is used in this example, all the feature files as described in the table above are generated. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── three_ecoli_ids.pkl
        │       ├── three_ecoli_X.pkl
        │       ├── three_ecoli_Xa.pkl
        │       ├── three_ecoli_Xc.pkl
        │       ├── three_ecoli_Xe.pkl
        │       ├── three_ecoli_Xea.pkl
        │       ├── three_ecoli_Xec.pkl
        │       ├── three_ecoli_Xm.pkl
        │       ├── three_ecoli_Xp.pkl
        │       ├── three_ecoli_labels_leads.pkl
        │       ├── three_ecoli_y_leads.pkl
        │       └── ...
	├── result/
        │       ├── three_ecoli
        |       |         ├── CFT073 
        |       |         │      └── pathway_report.tsv
        |       |         ├── EDL933 
        |       |         │      └── pathway_report.tsv
        |       |         └── MG1655 
        |       |                └── pathway_report.tsv
        │       └── ...
	└── leADS/
                └── ...

A visual depiction of the pathway_report.tsv file generated in the result/ directory for E.coli EDL933 is shown below. Each pathway predicted (shown in rows) has an associated score, abundance value, coverage, and the enzymes that map to that particular pathway (shown in columns).

FrameID Name Score Predicted Abundance Coverage MappedEC
VALSYN-PWY L-valine biosynthesis 1 1 9 1 EC-1.1.1.86, EC-2.2.1.6, EC-2.6.1.42, EC-4.2.1.9
PWY0-541 cyclopropane fatty acid (CFA) biosynthesis 1 1 1 1 EC-2.1.1.79
PWY-5971 palmitate biosynthesis II (bacteria and plants) 1 1 1.24 0.45 EC-1.3.1.9, EC-2.3.1.41, EC-2.3.1.85, EC-6.2.1.3
PLPSAL-PWY pyridoxal 5'-phosphate salvage I 1 1 1.16 1 EC-1.4.3.5, EC-2.7.1.35
PWY-7227 adenosine deoxyribonucleotides de novo biosynthesis 0.80 1 5 1 EC-1.17.4.1, EC-2.7.4.6
GLYOXYLATE-BYPASS glyoxylate cycle 1 1 5 0.80 EC-1.1.1.37, EC-2.3.3.16, EC-2.3.3.9, EC-4.1.3.1

Example 2:

To predict outputs and compile pathway report from a dataset (e.g. "symbionts_Xe.pkl") using a pre-trained model ("leADS.pkl"), execute the following command:

python main.py --predict --pred-labels --pathway-report --soft-voting --object-name "biocyc.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --pathway2ec-name "pathway2ec.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "symbionts_Xe.pkl" --file-name "symbionts" --model-name "leADS" --num-jobs 2

After running the command, the output will be saved to the dataset/ and result/ folder. The output files described in the table above are generated. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── symbionts_labels_leads.pkl
        │       ├── symbionts_y_leads.pkl
        │       └── ...
	├── result/
        │       ├── symbionts
        |       |       ├── 0 
        |       |       │   └── pathway_report.tsv
        |       |       ├── 1 
        |       |       │   └── pathway_report.tsv
        |       |       └── 2 
        |       |           └── pathway_report.tsv
        |       └── ...
	└── leADS/
                └── ...

The pathway_report.tsv file generated in the result/ directory has the same structure as the one shown under Example 1.

Example 3:

To predict outputs from a dataset (e.g. "cami_Xe.pkl") without --pathway-report and --parse-pf flags using a pre-trained model ("leADS.pkl") execute the following command:

python main.py --predict --pred-labels --soft-voting --X-name "cami_Xe.pkl" --file-name "cami" --model-name "leADS" --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. The output file described in the table above is generated. The tree structure for the folder with the output will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── cami_y_leads.pkl
        │       └── ...
	├── result/     
        │       └── ...
	└── leADS/
                └── ...
Clone this wiki locally