Source code of the paper Camele & Hasperué, Performance analysis of the Survival-SVM classifier applied to gene expression databases, CACIC (2023).
To run the code you need to install the dependencies:
- Create a virtual env:
python3 -m venv venv
(only once). - Activate virtual env:
source venv/bin/activate
(only when used)- To deactivate the virtual env just run:
deactivate
- To deactivate the virtual env just run:
- Install all the dependencies:
pip install -r requirements.txt
The dataset used is Breast Invasive Carcinoma (TCGA, PanCancer Atlas) (which is liste in cBioPortal datasets page). The files data_clinical_patient.txt
and data_RNA_Seq_v2_mRNA_median_Zscores.txt
must be placed in the Datasets
folder.
The main code is in the core.py
file where the metaheuristic is executed and the results are reported and saved in a CSV file that will go to the Results
folder.
The utils.py
file contains import functions, dataset label column binarization, preprocessing, among other useful functions.
The metaheuristics algorithm and its variants can be found in the metaheuristics.py
file.
The times.py
file contains the code to evaluate how long the execution takes using different amounts of features. In plot_times.py
is the code to plot these times using the JSON file generated with the first plot_times.py
file. script.
The get_logs_metrics.py
is useful to create a JSON file from logs files (see command to run scripts below) and get the metrics to be plotted in the plot_times.py
script.
To learn more about SVM Survival or Random Survival Forest models read the Scikit-survival blog.
Spark has problems with importing user-defined modules, so we need to leave a file called scripts.zip
that contains all the necessary modules to be distributed among Spark's Workers. Run the following commands to get everything working:
- Configure all the experiment's parameters in the
times.py
file. - Compress all the needed scripts inside
scripts.zip
running:./zip_modules.sh
- Inside the Spark Cluster's master container run:
spark-submit --py-files scripts.zip times.py &> logs.txt &
. When the execution is finished, a.csv
with the exact datetime at which the script was run will remain in theResults
folder with all the results obtained. The filelogs.txt
can be processed by the scriptget_logs_metrics.py
to generate a JSON and plot that data, we recommend to store it in theLogs
folder as the script is pointing to that directory.
This code is distributed under the MIT license.