Performance analysis of the Survival-SVM classifier applied to gene expression databases

Source code of the paper Camele & Hasperué, Performance analysis of the Survival-SVM classifier applied to gene expression databases, CACIC (2023).

Installation

To run the code you need to install the dependencies:

Create a virtual env: python3 -m venv venv (only once).
Activate virtual env: source venv/bin/activate (only when used)
1. To deactivate the virtual env just run: deactivate
Install all the dependencies: pip install -r requirements.txt

Datasets

The dataset used is Breast Invasive Carcinoma (TCGA, PanCancer Atlas) (which is liste in cBioPortal datasets page). The files data_clinical_patient.txt and data_RNA_Seq_v2_mRNA_median_Zscores.txt must be placed in the Datasets folder.

Code structure

The main code is in the core.py file where the metaheuristic is executed and the results are reported and saved in a CSV file that will go to the Results folder.

The utils.py file contains import functions, dataset label column binarization, preprocessing, among other useful functions.

The metaheuristics algorithm and its variants can be found in the metaheuristics.py file.

The times.py file contains the code to evaluate how long the execution takes using different amounts of features. In plot_times.py is the code to plot these times using the JSON file generated with the first plot_times.py file. script.

The get_logs_metrics.py is useful to create a JSON file from logs files (see command to run scripts below) and get the metrics to be plotted in the plot_times.py script.

To learn more about SVM Survival or Random Survival Forest models read the Scikit-survival blog.

Usage

Spark has problems with importing user-defined modules, so we need to leave a file called scripts.zip that contains all the necessary modules to be distributed among Spark's Workers. Run the following commands to get everything working:

Configure all the experiment's parameters in the times.py file.
Compress all the needed scripts inside scripts.zip running: ./zip_modules.sh
Inside the Spark Cluster's master container run: spark-submit --py-files scripts.zip times.py &> logs.txt &. When the execution is finished, a .csv with the exact datetime at which the script was run will remain in the Results folder with all the results obtained. The file logs.txt can be processed by the script get_logs_metrics.py to generate a JSON and plot that data, we recommend to store it in the Logs folder as the script is pointing to that directory.

License

This code is distributed under the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance analysis of the Survival-SVM classifier applied to gene expression databases

Installation

Datasets

Code structure

Usage

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Logs		Logs
Results		Results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
core.py		core.py
get_logs_metrics.py		get_logs_metrics.py
metaheuristics.py		metaheuristics.py
plot_times.py		plot_times.py
requirements.txt		requirements.txt
times.py		times.py
utils.py		utils.py
zip_modules.sh		zip_modules.sh

License

midusi/paper-CACIC-2023

Folders and files

Latest commit

History

Repository files navigation

Performance analysis of the Survival-SVM classifier applied to gene expression databases

Installation

Datasets

Code structure

Usage

License

About

Resources

License

Stars

Watchers

Forks

Languages