-
Notifications
You must be signed in to change notification settings - Fork 1
4. Tutorial on pathway prediction
leADS is used to accurately predict pathways in both organismal and multi-organismal genomes. This tutorial is meant to walk you through the basic steps of pathway prediction using either your own input data or the test data provided by us. Once the input (in the specified format below), trained model, and other required files are provided, a detailed pathway report is generated that can be used for further downstream analysis.
Note: Make sure to put the source code leADS/
(see Installing leADS) into the same directory as explained in the Download files section. Also, create a folder result/
in the same leADS_materials/
directory.
The final structure of the folder should look like this:
leADS_materials/
├── objectset/
│ └── ...
├── model/
│ └── ...
├── dataset/
│ └── ...
├── result/
│ └── ...
└── leADS/
└── ...
For all experiments, using a terminal
(On Linux and macOS) or an Anaconda command prompt
(On Windows), navigate to the src/
folder in the leADS/
directory and then run the commands as shown in the Examples section.
To display leADS' running options use: python main.py --help
. It should be self-contained.
The input for pathway predictions is either a .pf
file generated directly by MetaPathways v2 or a .pkl
file generated after following the steps under Advanced usage. One can also use the files provided by us.
In addition to the input data, some of the object files listed here are also required to carry out a successful run. The required object files include:
- biocyc.pkl
- pathway2ec.pkl
- pathway2ec_idx.pkl
- hin.pkl
- pathway2vec_embeddings.npz
- leADS.pkl
The basic command is represented below. Do not use this to run pathway predictions. This command is only a representation of all the flags used. See Examples below on how to carry out pathway prediction.
python main.py \
--predict \
--pred-labels \
--pathway-report \
--parse-pf \
--soft-voting \
--object-name "biocyc.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_Xe.pkl" \
--file-name "[input (or save) file name]" \
--model-name "leADS" \
--dsfolder "[input dataset folder]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--batch 50 \
--num-jobs 2
Note:
- Keep in mind that while running the command with
--parse-pf
,DATANAME
in both arguments --X-name "[DATANAME]_Xe.pkl" and --file-name "[DATANAME]" should be the same. - Use the --parse-pf flag only if the input to leADS is a
.pf
file(s).
The table below summarizes all the command-line arguments that are specific to this framework:
Argument name | Description | Value |
---|---|---|
--predict | Predicting bags_labels distribution from inputs using leADS | False |
--pred-labels | Predicting labels in input | False |
--pathway-report | Enables to generate a detailed report for pathways for each instance | False |
--parse-pf | Boolean variable implying whether to parse Pathologic format file (.pf) from a given folder | False |
--soft-voting | Boolean variable indicating whether to predict labels based on the calibrated sums of the predicted probabilities from an ensemble | False |
--object-name | The preprocessed MetaCyc database file used for predictions | biocyc.pkl |
--pathway2ec-name | The matrix file representing Pathway-EC association | pathway2ec.pkl |
--pathway2ec-idx-name | The pathway2ec association indices file | pathway2ec_idx.pkl |
--hin-name | The heterogeneous information network file | hin.pkl |
--features-name | The features corresponding ECs and pathways | pathway2vec_embeddings.npz |
--X-name | The Input file name to be provided for predictions | [DATANAME]_Xe.pkl |
--file-name | The names of input preprocessed files (without extension) | [input (or save) file name] |
--model-name | The name of the model excluding any **EXTENSION ** | leADS |
--dsfolder | The dataset folder name | [Input dataset folder] |
--ospath | The path to the data object that contains extracted information from the MetaCyc database (e.g. biocyc.pkl) | Outside source code |
--dspath | The path to the datasets | Outside source code |
--mdpath | The path to the pre-trained model (e.g. leADS.pkl) | Outside source code |
--rspath | The path to store results | Outside source code |
--batch | Batch size | 50 |
--num-jobs | The number of parallel workers | 2 |
The output files generated after running the command are:
With --parse-pf flag: (see Example 1):
File | Description |
---|---|
[FILENAME]_Xa.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features. |
[FILENAME]_Xc.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features. |
[FILENAME]_Xe.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings. |
[FILENAME]_Xea.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features. |
[FILENAME]_Xec.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features. |
[FILENAME]_Xm.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices, embeddings, abundance, and coverage features. |
[FILENAME]_Xp.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and their embeddings in columns. |
[FILENAME]_y_leads.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each organism (or multi-organisms). |
[FILENAME]_labels_leads.pkl | A file (stored in the "dspath" location) representing a list of pathway names for each organism (or multi-organisms). |
pathway_report.tsv | Pathway report files (tab-separated values) indicating all predicted pathways, their associated scores, abundance values, and the mapped EC numbers. These files are stored in each folder of each organisms ids in the "rspath" location |
Without --parse-pf flag: (see Example 2):
File | Description |
---|---|
[FILENAME]_y_leads.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each organism (or multi-organisms). |
[FILENAME]_labels_leads.pkl | A file (stored in the "dspath" location) representing a list of pathway names for each organism (or multi-organisms). |
pathway_report.tsv | Pathway report files (tab-separated values) indicating all predicted pathways, their associated scores, abundance values, and the mapped Enzyme Commission (EC numbers). These files are stored in each folder of each organisms ids in the "rspath" location |
Without --pathway-report and --parse-pf flags: (see Example 3)
File | Description |
---|---|
[FILENAME]_y_leads.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each organism (or multi-organisms). |
To produce outputs and compile pathway report from the "three_ecoli" data, generated by MetaPathways v2, using a pre-trained model ("leADS.pkl"), execute the following command:
python main.py --predict --pred-labels --pathway-report --parse-pf --soft-voting --object-name "biocyc.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --pathway2ec-name "pathway2ec.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "three_ecoli_Xe.pkl" --file-name "three_ecoli" --model-name "leADS" --dsfolder "three_ecoli" --num-jobs 2
Upon executing this command, the "three_ecoli_Xe.pkl" (along with other feature files) will be produced. You can also see that in both arguments: --X-name "three_ecoli_Xe.pkl"
and --file-name "three_ecoli"
, the same name that is **three_ecoli ** is applied.
After running the command, the output will be saved to the dataset/
and result/
folders. Since the --build-features
flag is used in this example, all the feature files as described in the table above are generated. The tree structure for the folder with the outputs will look like this:
leADS_materials/
├── objectset/
│ └── ...
├── model/
│ └── ...
├── dataset/
│ ├── three_ecoli_ids.pkl
│ ├── three_ecoli_X.pkl
│ ├── three_ecoli_Xa.pkl
│ ├── three_ecoli_Xc.pkl
│ ├── three_ecoli_Xe.pkl
│ ├── three_ecoli_Xea.pkl
│ ├── three_ecoli_Xec.pkl
│ ├── three_ecoli_Xm.pkl
│ ├── three_ecoli_Xp.pkl
│ ├── three_ecoli_labels_leads.pkl
│ ├── three_ecoli_y_leads.pkl
│ └── ...
├── result/
│ ├── three_ecoli
| | ├── CFT073
| | │ └── pathway_report.tsv
| | ├── EDL933
| | │ └── pathway_report.tsv
| | └── MG1655
| | └── pathway_report.tsv
│ └── ...
└── leADS/
└── ...
A visual depiction of the pathway_report.tsv
file generated in the result/
directory for E.coli EDL933 is shown below. Each pathway predicted (shown in column 1 and 2) has an associated score, abundance value, coverage, and the enzymes that map to that particular pathway (shown in columns 3 to 7).
FrameID | Name | Score | Predicted | Abundance | Coverage | MappedEC |
---|---|---|---|---|---|---|
VALSYN-PWY | L-valine biosynthesis | 1 | 1 | 9 | 1 | EC-1.1.1.86, EC-2.2.1.6, EC-2.6.1.42, EC-4.2.1.9 |
PWY0-541 | cyclopropane fatty acid (CFA) biosynthesis | 1 | 1 | 1 | 1 | EC-2.1.1.79 |
PWY-5971 | palmitate biosynthesis II (bacteria and plants) | 1 | 1 | 1.24 | 0.45 | EC-1.3.1.9, EC-2.3.1.41, EC-2.3.1.85, EC-6.2.1.3 |
PLPSAL-PWY | pyridoxal 5'-phosphate salvage I | 1 | 1 | 1.16 | 1 | EC-1.4.3.5, EC-2.7.1.35 |
PWY-7227 | adenosine deoxyribonucleotides de novo biosynthesis | 0.80 | 1 | 5 | 1 | EC-1.17.4.1, EC-2.7.4.6 |
GLYOXYLATE-BYPASS | glyoxylate cycle | 1 | 1 | 5 | 0.80 | EC-1.1.1.37, EC-2.3.3.16, EC-2.3.3.9, EC-4.1.3.1 |
To produce outputs (without generating all the feature files) and compile pathway report from a preprocessed dataset (e.g. "three_ecoli_Xe.pkl" in the above example or "symbionts_Xe.pkl") using a pre-trained model ("leADS.pkl"), execute the following command:
python main.py --predict --pred-labels --pathway-report --soft-voting --object-name "biocyc.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --pathway2ec-name "pathway2ec.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "symbionts_Xe.pkl" --file-name "symbionts" --model-name "leADS" --num-jobs 2
After running the command, the output will be saved to the dataset/
and result/
folder. The output files described in the table above are generated. The tree structure for the folder with the outputs will look like this:
leADS_materials/
├── objectset/
│ └── ...
├── model/
│ └── ...
├── dataset/
│ ├── symbionts_labels_leads.pkl
│ ├── symbionts_y_leads.pkl
│ └── ...
├── result/
│ ├── symbionts
| | ├── 0
| | │ └── pathway_report.tsv
| | ├── 1
| | │ └── pathway_report.tsv
| | └── 2
| | └── pathway_report.tsv
| └── ...
└── leADS/
└── ...
The pathway_report.tsv
file generated in the result/
directory has the same structure as the one shown under Example 1.
To produce outputs from a dataset (e.g. "cami_Xe.pkl") without --pathway-report and --parse-pf flags using a pre-trained model ("leADS.pkl") execute the following command:
python main.py --predict --pred-labels --soft-voting --X-name "cami_Xe.pkl" --file-name "cami" --model-name "leADS" --num-jobs 2
After running the command, the output will be saved to the dataset/
folder. The output file described in the table above is generated. The tree structure for the folder with the output will look like this:
leADS_materials/
├── objectset/
│ └── ...
├── model/
│ └── ...
├── dataset/
│ ├── cami_y_leads.pkl
│ └── ...
├── result/
│ └── ...
└── leADS/
└── ...