In this study we aim to analyse the effect that padding the amino acid sequences has on the performance of deep learning models. Specifically, we have chosen a hierarchical EC number prediction task to carry out the analysis. We use three different architectures (only_denses, 1_conv, stack_conv) to check if they affect the padding effect.
The data used in this study consists on all the reviewed protein sequences of the taxonomy Archaea from Uniprot version 2019_05.
The project is almost entirely coded in Python 3.6.7. R version 3.4.4 has been used for building the explanatory linear models (notebooks 07 and 09).
The findings of this analysis are explained in the following article: Lopez-del Rio, A., Martin, M., Perera-Lluna, A. et al. Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci Rep 10, 14634 (2020). https://doi.org/10.1038/s41598-020-71450-8.
- Functions are defined in the src/ folder (.py and .R files)
- The workflow of the analysis is applied through Jupyter Notebooks in the notebooks/ folder. The files are preceded by a number that indicates the chronological order of their execution.
- raw_data/ folder contains the CSV file with reviewed Uniprot entries for Archaea.
- data/ is the folder where intermediate and final results will be stored when running the notebooks.
- 00_creating_data.ipynb
- 01_pre_statistics.ipynb
- 02_task1_training.ipynb
- 03_task1_comparison.ipynb
- 04_task2_training.ipynb
- 05_task2_comparison.ipynb
- 06_comparing_architectures.ipynb
- 07_linear_models_metrics.ipynb
- 08_activations_sequences_pca.ipynb
- 09_linear_models_pcs.ipynb
The runs have been executed on the following hardware from the B2SLab (Universitat Politecnica de Catalunya):
-
tob: 8 threads, 32GB RAM, NVIDIA TITAN Xp GeForce GTX 1070
-
lapsus: 12 threads, 32GB RAM, 2 x NVIDIA GeForce GTX 1070
- The absPath variable at the beginning of each notebook and src file should be changed once the repository is cloned to the correct path in each case.
- Notebooks 02-04 are implemented to analyse only one architecture each time. The block of variables/parameters in the second cell corresponding to the studied architecture should be uncommented (and the rest, commented) prior to running the notebook.
- In src/Target, all the different types of padding tested in this study are implemented.
- Execution without GPU of notebooks 02-05 and 08 may require considerable time and it is not recommended.