Skip to content

Code associated with the paper 'Cracking the blackbox of deep sequence-based protein-protein interaction prediction'

License

Notifications You must be signed in to change notification settings

daisybio/data-leakage-ppi-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cracking the blackbox of deep sequence-based protein-protein interaction prediction

This repository contains all datasets and code used to show that sequence-based deep PPI prediction methods only achieve phenomenal results due to data leakage and learning from sequence similarities and node degrees.

We used git-lfs to store some of the files so make sure to install it before cloning this repository. Most of the code can be run with our main environment (mac, linux). For PIPR, however, a custom environment ist needed (mac, linux).

The biorxiv preprint for this paper is available at https://www.biorxiv.org/content/10.1101/2023.01.18.524543v2. The AIMe Report for this project is available at https://aime-registry.org/report/VRPXym.

Datasets

The original Guo and Huang datasets were obtained from DeepFE and can be found either in their GitHub Repository or under algorithms/DeepFE-PPI/dataset/11188/ (Guo) and algorithms/DeepFE-PPI/dataset/human/ (Huang). The Guo dataset can also be found in the PIPR respository or under algorithms/seq_ppi/yeast/preprocessed/.

The original Du dataset was obtained from the original publication and can be found under Datasets_PPIs/Du_yeast_DIP/.

The Pan dataset can be obtained from the original publication and from the PIPR Repository. It is in algorithms/seq_ppi/sun/preprocessed/.

The Richoux datasets were obtained from their Gitlab. The regular dataset is in algorithms/DeepPPI/data/mirror/, the strict one in algorithms/DeepPPI/data/mirror/double/.

The unbalanced D-SCRIPT dataset was obtained from their Zenodo repository. It is in algorithms/D-SCRIPT-main/dscript-data/.

All original datasets were rewritten into the format used by SPRINT and split into train and test with algorithms/SPRINT/create_SPRINT_datasets.py. They are in in algorithms/SPRINT/data/original. This script was also used to rewire and split the datasets (generate_RDPN) (-> algorithms/SPRINT/data/rewired). Before you run this script, you have to run compute_sim_matrix.py.

Partitions

Calculate the protein lengths with this code in the Datasets_PPIs/SwissProt directory for the length-normalization of the bitscores:

awk '/^>/ {printf("%s\t",substr($0,2)); next;} {print length}' yeast_swissprot_oneliner.fasta > yeast_proteins_lengths.txt
awk '/^>/ {printf("%s\t",substr($0,2)); next;} {print length}' human_swissprot_oneliner.fasta > human_proteins_lengths.txt

The human and yeast proteomes were downloaded from Uniprot and sent to the team of SIMAP2. They sent back the similarity data which we make available under https://doi.org/10.6084/m9.figshare.21510939 (submatrix.tsv.gz). Download this and unzip it in network_data/SIMAP2.

We preprocessed this data in order to give it to the KaHIP kaffpa algorithm with simap_preprocessing.py:

  1. We separated the file to obtain only human-human and yeast-yeast protein similarities
  2. We converted the edge lists to networks and converted the Uniprot node labels to integer labels because KaHIP needs METIS files as input. These files can only handle integer node labels
  3. We exported the networks as METIS files with normalized bitscores as edge weights: human, yeast

If you're using a Mac, you can use our compiled KaHIP version. On Linux, make sure you have OpenMPI installed and run the following commands:

rm -r KaHIP
git clone https://github.com/KaHIP/KaHIP
cd KaHIP/
./compile_withcmake.sh
cd ..

Then, feed the METIS files to the KaHIP kaffpa algorithm with the following commands:

./KaHIP/deploy/kaffpa ./network_data/SIMAP2/human_networks/only_human_bitscore_normalized.graph --seed=1234 --output_filename="./network_data/SIMAP2/human_networks/only_human_partition_bitscore_normalized.txt" --k=2 --preconfiguration=strong
./KaHIP/deploy/kaffpa ./network_data/SIMAP2/yeast_networks/only_yeast_bitscore_normalized.graph --seed=1234 --output_filename="./network_data/SIMAP2/yeast_networks/only_yeast_partition_bitscore_normalized.txt" --k=2 --preconfiguration=strong

The output files containing the partitioning was mapped back to the original UniProt IDs in kahip.py. Nodelists: human, yeast.

The PPIs from the 7 original datasets were then split according to the KaHIP partitions into blocks Inter, Intra-0, and Intra-1 with rewrite_datasets.py and are in algorithms/SPRINT/data/partitions.

Gold Standard Dataset

We wanted our gold standard dataset to be split into training, validation, and testing. There should be no overlaps between the three datasets and a minimum amount of sequence similarity so that the methods can learn more complex features. Hence, we partitioned the human proteome into three parts by running:

./KaHIP/deploy/kaffpa ./network_data/SIMAP2/human_networks/only_human_bitscore_normalized.graph --seed=1234 --output_filename="./network_data/SIMAP2/human_networks/only_human_partition_3_bitscore_normalized.txt" --k=3 --preconfiguration=strong

Then, the Hippie v2.3 database was downloaded from their website. The dataset was split intro training, validation, and testing using the partition. Negative PPIs were sampled randomly, node degrees of proteins from the positive dataset were preserved in expectation in the negative dataset. The resulting blocks Intra-0, Intra-1, and Intra-2 were redundancy-reduced using CD-HIT. CD-HIT was cloned from their GitHub and built following the instructions given there. The datasets were redundancy reduced at 40% pairwise sequence similarity by first exporting their fasta sequences and then running:

./cdhit/cd-hit -i Datasets_PPIs/Hippiev2.3/Intra_0.fasta -o sim_intra0.out -c 0.4 -n 2
./cdhit/cd-hit -i Datasets_PPIs/Hippiev2.3/Intra_1.fasta -o sim_intra1.out -c 0.4 -n 2
./cdhit/cd-hit -i Datasets_PPIs/Hippiev2.3/Intra_2.fasta -o sim_intra2.out -c 0.4 -n 2

Redundancy was also reduced between the datasets:

./cdhit/cd-hit-2d -i Datasets_PPIs/Hippiev2.3/Intra_0.fasta -i2 Datasets_PPIs/Hippiev2.3/Intra_1.fasta -o Datasets_PPIs/Hippiev2.3/sim_intra0_intra_1.out -c 0.4 -n 2
./cdhit/cd-hit-2d -i Datasets_PPIs/Hippiev2.3/Intra_0.fasta -i2 Datasets_PPIs/Hippiev2.3/Intra_2.fasta -o Datasets_PPIs/Hippiev2.3/sim_intra0_intra_2.out -c 0.4 -n 2
./cdhit/cd-hit-2d -i Datasets_PPIs/Hippiev2.3/Intra_1.fasta -i2 Datasets_PPIs/Hippiev2.3/Intra_2.fasta -o Datasets_PPIs/Hippiev2.3/sim_intra1_intra_2.out -c 0.4 -n 2

Then, the redundant sequences were extracted from the output files

less Datasets_PPIs/Hippiev2.3/sim_intra0.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra0.txt
less Datasets_PPIs/Hippiev2.3/sim_intra1.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra1.txt
less Datasets_PPIs/Hippiev2.3/sim_intra2.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra2.txt

less Datasets_PPIs/Hippiev2.3/sim_intra0_intra_1.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra01.txt
less Datasets_PPIs/Hippiev2.3/sim_intra0_intra_2.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra02.txt
less Datasets_PPIs/Hippiev2.3/sim_intra1_intra_2.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra12.txt

and filtered out of training, validation, and testing. All Python code for this task can be found in create_gold_standard.py. The data is available at https://doi.org/10.6084/m9.figshare.21591618.v2.

Methods

Custom baseline ML methods

Our 6 implemented similarity-based baseline ML methods are implemented in algorithms/Custom/.

  1. We converted the SIMAP2 similarity table into an all-against-all similarity matrix in algorithms/Custom/compute_sim_matrix.py.
  2. We reduced the dimensionality of this matrix via PCA, MDS, and node2vec in algorithms/Custom/compute_dim_red.py:
    1. PCA human, yeast
    2. MDS human, yeast
    3. For node2vec, we first converted the similarity matrix into a network and exported its edgelist (human, yeast) and nodelist (human, yeast). Then, we called node2vec

If you have a Mac, you can use the precompiled node2vec binaries. If you have a Linux, follow the following steps:

rm -r snap
git clone https://github.com/snap-stanford/snap.git
cd snap
make all
cd ..

Then, call node2vec with

cd snap/examples/node2vec
./node2vec -i:../../../algorithms/Custom/data/yeast.edgelist -o:../../../algorithms/Custom/data/yeast.emb
./node2vec -i:../../../algorithms/Custom/data/human.edgelist -o:../../../algorithms/Custom/data/human.emb

The RF, SVM, and the 2 node topology methods are implemented in algorithms/Custom/learn_models.py. All tests are executed in algorithms/Custom/run.py. Results were saved to the results folder.

DeepFE

The code was pulled from their GitHub Repository and updated to the current tensorflow version. All tests are run via the shell slurm script or algorithms/DeepFE-PPI/train_all_datasets.py. Results were saved to the results folder.

Richoux-FC and Richoux-LSTM

The code was pulled from their Gitlab Repository. All tests can be run via the shell slurm script or algorithms/DeepPPI/keras/train_all_datasets.py. Results were saved to the results folder.

PIPR

The code was pulled from their GitHub Repository and updated to the current tensorflow version. Activate the PIPR environment for running all PIPR code! All tests are run via the shell slurm script or algorithms/seq_ppi/binary/model/lasagna/train_all_datasets.py. Results were saved to the results folder.

D-SCRIPT and Topsy-Turvy

The code was pulled from their GitHub Repository. Embeddings were calculated for all human and all yeast proteins:

dscript embed --seqs Datasets_PPIs/SwissProt/human_swissprot.fasta --outfile human_embedding.h5
dscript embed --seqs Datasets_PPIs/SwissProt/yeast_swissprot.fasta --outfile yeast_embedding.h5

All tests can be run via the slurm scripts in the D-SCRIPT folder or via command line:

dscript train --train train.txt --test test.txt --embedding embedding.h5 --save-prefix ./models/dscript_model -o ./results_dscript/result_training.txt -d 0
dscript evaluate --test test.txt --embedding embedding.h5 --model ./models/dscript_model_final.sav  -o ./results_dscript/result.txt -d 0

dscript train --topsy-turvy --train train.txt --test test.txt --embedding embedding.h5 --save-prefix ./models/tt_model -o ./results_tt/result_training.txt -d 0
dscript evaluate --test test.txt --embedding embedding.h5 --model ./models/tt_model_final.sav  -o ./results_tt/result.txt -d 0

Result metrics were calulated with compute_metrics.py.

SPRINT

The code was pulled from their GitHub Repository. You need a g++ compiler and the boost library (http://www.boost.org/) to compile the source code.

After downloading boost, move it to a fitting directory like /usr/local/. Edit the makefile and adapt the path to boost (-I /usr/local/boost_1_80_0). Then run

cd algorithms/SPRINT
mkdir bin
make predict_interactions_serial
make compute_HSPs_serial 

The yeast proteome fasta file was first transformed such that each sequence only occupies one line rewrite_yeast_fasta.py.

Then, the proteome was preprocessed with compute_yeast_HSPs.sh.

The preprocessed human proteome was downloaded from the SPRINT website (precomputed similarities). After downloading the data, move it to the HSP folder in algorithms/SPRINT.

Then tests are run via shell slurm scripts: original, rewired, partitions. Results were saved to the results folder. AUCs and AUPRs were calculated with algorithms/SPRINT/results/calculate_scores.py.

Visualizations

All visualizations were made with the R scripts in visualizations. Plots were saved to visualizations/plots.