This repository contains all datasets and code used to show that sequence-based deep PPI prediction methods only achieve phenomenal results due to data leakage and learning from sequence similarities and node degrees.
We used git-lfs
to store some of the files so make sure to install it before cloning this repository.
Most of the code can be run with our main environment (mac, linux).
For PIPR, however, a custom environment ist needed (mac, linux).
The biorxiv preprint for this paper is available at https://www.biorxiv.org/content/10.1101/2023.01.18.524543v2. The AIMe Report for this project is available at https://aime-registry.org/report/VRPXym.
The original Guo and Huang datasets were obtained from DeepFE
and can be found either in their GitHub Repository
or under algorithms/DeepFE-PPI/dataset/11188/
(Guo) and algorithms/DeepFE-PPI/dataset/human/
(Huang).
The Guo dataset can also be found in the PIPR respository
or under algorithms/seq_ppi/yeast/preprocessed/
.
The original Du dataset was obtained from the original publication
and can be found under Datasets_PPIs/Du_yeast_DIP/
.
The Pan dataset can be obtained from the original publication
and from the PIPR Repository.
It is in algorithms/seq_ppi/sun/preprocessed/
.
The Richoux datasets were obtained from their Gitlab.
The regular dataset is in algorithms/DeepPPI/data/mirror/
, the strict one in
algorithms/DeepPPI/data/mirror/double/
.
The unbalanced D-SCRIPT dataset was obtained from their Zenodo repository.
It is in algorithms/D-SCRIPT-main/dscript-data/
.
All original datasets were rewritten into the format used by SPRINT and split
into train and test with algorithms/SPRINT/create_SPRINT_datasets.py
.
They are in in algorithms/SPRINT/data/original
.
This script was also used to rewire and split the datasets (generate_RDPN
) (-> algorithms/SPRINT/data/rewired
).
Before you run this script, you have to run compute_sim_matrix.py
.
Calculate the protein lengths with this code in the Datasets_PPIs/SwissProt directory for the length-normalization of the bitscores:
awk '/^>/ {printf("%s\t",substr($0,2)); next;} {print length}' yeast_swissprot_oneliner.fasta > yeast_proteins_lengths.txt
awk '/^>/ {printf("%s\t",substr($0,2)); next;} {print length}' human_swissprot_oneliner.fasta > human_proteins_lengths.txt
The human and yeast proteomes were downloaded from Uniprot and sent to the
team of SIMAP2. They sent back the similarity data which we make available under
https://doi.org/10.6084/m9.figshare.21510939 (submatrix.tsv.gz
).
Download this and unzip it in network_data/SIMAP2
.
We preprocessed this data in order to give it to the KaHIP kaffpa algorithm with simap_preprocessing.py:
- We separated the file to obtain only human-human and yeast-yeast protein similarities
- We converted the edge lists to networks and converted the Uniprot node labels to integer labels because KaHIP needs
METIS
files as input. These files can only handle integer node labels - We exported the networks as
METIS
files with normalized bitscores as edge weights: human, yeast
If you're using a Mac, you can use our compiled KaHIP version. On Linux, make sure you have OpenMPI installed and run the following commands:
rm -r KaHIP
git clone https://github.com/KaHIP/KaHIP
cd KaHIP/
./compile_withcmake.sh
cd ..
Then, feed the METIS files to the KaHIP kaffpa algorithm with the following commands:
./KaHIP/deploy/kaffpa ./network_data/SIMAP2/human_networks/only_human_bitscore_normalized.graph --seed=1234 --output_filename="./network_data/SIMAP2/human_networks/only_human_partition_bitscore_normalized.txt" --k=2 --preconfiguration=strong
./KaHIP/deploy/kaffpa ./network_data/SIMAP2/yeast_networks/only_yeast_bitscore_normalized.graph --seed=1234 --output_filename="./network_data/SIMAP2/yeast_networks/only_yeast_partition_bitscore_normalized.txt" --k=2 --preconfiguration=strong
The output files containing the partitioning was mapped back to the original UniProt IDs in kahip.py. Nodelists: human, yeast.
The PPIs from the 7 original datasets were then split according to the KaHIP partitions into blocks
Inter, Intra-0, and Intra-1 with rewrite_datasets.py and are in algorithms/SPRINT/data/partitions
.
We wanted our gold standard dataset to be split into training, validation, and testing. There should be no overlaps between the three datasets and a minimum amount of sequence similarity so that the methods can learn more complex features. Hence, we partitioned the human proteome into three parts by running:
./KaHIP/deploy/kaffpa ./network_data/SIMAP2/human_networks/only_human_bitscore_normalized.graph --seed=1234 --output_filename="./network_data/SIMAP2/human_networks/only_human_partition_3_bitscore_normalized.txt" --k=3 --preconfiguration=strong
Then, the Hippie v2.3 database was downloaded from their website. The dataset was split intro training, validation, and testing using the partition. Negative PPIs were sampled randomly, node degrees of proteins from the positive dataset were preserved in expectation in the negative dataset. The resulting blocks Intra-0, Intra-1, and Intra-2 were redundancy-reduced using CD-HIT. CD-HIT was cloned from their GitHub and built following the instructions given there. The datasets were redundancy reduced at 40% pairwise sequence similarity by first exporting their fasta sequences and then running:
./cdhit/cd-hit -i Datasets_PPIs/Hippiev2.3/Intra_0.fasta -o sim_intra0.out -c 0.4 -n 2
./cdhit/cd-hit -i Datasets_PPIs/Hippiev2.3/Intra_1.fasta -o sim_intra1.out -c 0.4 -n 2
./cdhit/cd-hit -i Datasets_PPIs/Hippiev2.3/Intra_2.fasta -o sim_intra2.out -c 0.4 -n 2
Redundancy was also reduced between the datasets:
./cdhit/cd-hit-2d -i Datasets_PPIs/Hippiev2.3/Intra_0.fasta -i2 Datasets_PPIs/Hippiev2.3/Intra_1.fasta -o Datasets_PPIs/Hippiev2.3/sim_intra0_intra_1.out -c 0.4 -n 2
./cdhit/cd-hit-2d -i Datasets_PPIs/Hippiev2.3/Intra_0.fasta -i2 Datasets_PPIs/Hippiev2.3/Intra_2.fasta -o Datasets_PPIs/Hippiev2.3/sim_intra0_intra_2.out -c 0.4 -n 2
./cdhit/cd-hit-2d -i Datasets_PPIs/Hippiev2.3/Intra_1.fasta -i2 Datasets_PPIs/Hippiev2.3/Intra_2.fasta -o Datasets_PPIs/Hippiev2.3/sim_intra1_intra_2.out -c 0.4 -n 2
Then, the redundant sequences were extracted from the output files
less Datasets_PPIs/Hippiev2.3/sim_intra0.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra0.txt
less Datasets_PPIs/Hippiev2.3/sim_intra1.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra1.txt
less Datasets_PPIs/Hippiev2.3/sim_intra2.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra2.txt
less Datasets_PPIs/Hippiev2.3/sim_intra0_intra_1.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra01.txt
less Datasets_PPIs/Hippiev2.3/sim_intra0_intra_2.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra02.txt
less Datasets_PPIs/Hippiev2.3/sim_intra1_intra_2.out.clstr| grep -E '([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}).*%$'|cut -d'>' -f2|cut -d'.' -f1 > Datasets_PPIs/Hippiev2.3/redundant_intra12.txt
and filtered out of training, validation, and testing. All Python code for this task can be found in create_gold_standard.py. The data is available at https://doi.org/10.6084/m9.figshare.21591618.v2.
Our 6 implemented similarity-based baseline ML methods are implemented in algorithms/Custom/
.
- We converted the SIMAP2 similarity table into an all-against-all similarity matrix in
algorithms/Custom/compute_sim_matrix.py
. - We reduced the dimensionality of this matrix via PCA, MDS, and node2vec in
algorithms/Custom/compute_dim_red.py
:
If you have a Mac, you can use the precompiled node2vec binaries. If you have a Linux, follow the following steps:
rm -r snap
git clone https://github.com/snap-stanford/snap.git
cd snap
make all
cd ..
Then, call node2vec with
cd snap/examples/node2vec
./node2vec -i:../../../algorithms/Custom/data/yeast.edgelist -o:../../../algorithms/Custom/data/yeast.emb
./node2vec -i:../../../algorithms/Custom/data/human.edgelist -o:../../../algorithms/Custom/data/human.emb
The RF, SVM, and the 2 node topology methods are implemented in algorithms/Custom/learn_models.py. All tests are executed in algorithms/Custom/run.py. Results were saved to the results folder.
The code was pulled from their GitHub Repository and updated to the current tensorflow version. All tests are run via the shell slurm script or algorithms/DeepFE-PPI/train_all_datasets.py. Results were saved to the results folder.
The code was pulled from their Gitlab Repository. All tests can be run via the shell slurm script or algorithms/DeepPPI/keras/train_all_datasets.py. Results were saved to the results folder.
The code was pulled from their GitHub Repository and updated to the current tensorflow version. Activate the PIPR environment for running all PIPR code! All tests are run via the shell slurm script or algorithms/seq_ppi/binary/model/lasagna/train_all_datasets.py. Results were saved to the results folder.
The code was pulled from their GitHub Repository. Embeddings were calculated for all human and all yeast proteins:
dscript embed --seqs Datasets_PPIs/SwissProt/human_swissprot.fasta --outfile human_embedding.h5
dscript embed --seqs Datasets_PPIs/SwissProt/yeast_swissprot.fasta --outfile yeast_embedding.h5
All tests can be run via the slurm scripts in the D-SCRIPT folder or via command line:
dscript train --train train.txt --test test.txt --embedding embedding.h5 --save-prefix ./models/dscript_model -o ./results_dscript/result_training.txt -d 0
dscript evaluate --test test.txt --embedding embedding.h5 --model ./models/dscript_model_final.sav -o ./results_dscript/result.txt -d 0
dscript train --topsy-turvy --train train.txt --test test.txt --embedding embedding.h5 --save-prefix ./models/tt_model -o ./results_tt/result_training.txt -d 0
dscript evaluate --test test.txt --embedding embedding.h5 --model ./models/tt_model_final.sav -o ./results_tt/result.txt -d 0
Result metrics were calulated with compute_metrics.py.
The code was pulled from their GitHub Repository. You need a g++ compiler and the boost library (http://www.boost.org/) to compile the source code.
After downloading boost, move it to a fitting directory like /usr/local/
.
Edit the makefile and adapt the path to boost (-I /usr/local/boost_1_80_0
).
Then run
cd algorithms/SPRINT
mkdir bin
make predict_interactions_serial
make compute_HSPs_serial
The yeast proteome fasta file was first transformed such that each sequence only occupies one line rewrite_yeast_fasta.py.
Then, the proteome was preprocessed with compute_yeast_HSPs.sh.
The preprocessed human proteome was downloaded from the SPRINT website (precomputed similarities).
After downloading the data, move it to the HSP
folder in algorithms/SPRINT
.
Then tests are run via shell slurm scripts: original, rewired, partitions. Results were saved to the results folder. AUCs and AUPRs were calculated with algorithms/SPRINT/results/calculate_scores.py.
All visualizations were made with the R scripts in visualizations. Plots were saved to visualizations/plots.