Skip to content
John Arevalo edited this page Jul 10, 2024 · 2 revisions

Welcome to MOTI$\mathcal{VE}$!

MOTI$\mathcal{VE}$ is a Drug Target Interaction graph dataset of compounds and genes.

All of the input and processed data files can found in the Cell Painting Gallery.

Data

Both the input data and processed data files are provided. The inputs folder contains all the annotations collected with their original identifiers, along with features at a more granular level (well-based). The data folder contains the processed features, the standardized annotations, and the data splits ready to be used in the DTI prediction task.

Input data

The input data includes:

  • Cell profiler features (Well level)
  • ~800K compound-gene annotations of the form:
    source target rel_type source_id database inchikey
    DB12026 SLCO1B3 DRUG_TRANSPORTER drugbank biokg MZBLZLWXUBZHSL-FZNJKFJKSA-N
    CHEMBL126075 ATAD5 unknown chembl dgidb QVZCXCJXTMIDME-UHFFFAOYSA-N
    5481350 CYP1A2 target pubchem drugrep JFVZFKDSXNQEJW-CQSZACIVSA-N
    DB08912 HSPB2 CuG drugbank hetionet BFSMGDJOXZAERB-UHFFFAOYSA-N
    10251 CCND1 DRUG_INHIBITION_GENE pubchem openbiolink ZONYXWQDUYMKFB-UHFFFAOYSA-N
    DB00643 STK25 UPREGULATES_CHuG drugbank pharmebinet OPXLLQIJSORQAM-UHFFFAOYSA-N
    DB11796 ABCG2 transporter drugbank primekg SWMDAPWAQQTBOG-UHFFFAOYSA-N
  • ~3M gene-gene annotations of the form:
    target_a target_b rel_type database
    LHX8 ABI2 PPI biokg
    UBC PUS1 interacts hetionet
    IFNA10 IFNA21 GENE_CATALYSIS_GENE openbiolink
    CSNK2A1 CDC37 INTERACTS_GiG pharmebinet
    SAR1A STX8 ppi primekg
  • ~10M compound-compound annotations of the form:
    source_a source_b rel_type source_id database inchikey_a inchikey_b
    DB01531 DB01587 DDI drugbank biokg LNNWVNGFPYWNQE-GMIGKAJZSA-N PWAJCNITSBZRBL-UHFFFAOYSA-N
    DB01333 DB01060 resembles drugbank hetionet RDLPVSKMFDYCOR-UEKVPHQBSA-N LSQZJLSUYDQPKJ-NJBDSQKTSA-N
    DB11609 DB12554 INTERACTS_CiC drugbank pharmebinet WCJFBSYALHQBSK-UHFFFAOYSA-N VYVKHNNGDFVQGA-UHFFFAOYSA-N
    DB00532 DB11609 synergistic interaction drugbank primekg GMHKMTDVRCWUDX-UHFFFAOYSA-N WCJFBSYALHQBSK-UHFFFAOYSA-N

Files origin

The inputs folder structure is as follows:

inputs
├── annotations
│   ├── compound_compound.parquet
│   ├── compound_gene.parquet
│   └── gene_gene.parquet
├── compound
│   ├── features.parquet
│   ├── image_locs.parquet
│   └── meta.csv.gz
├── crispr
│   ├── features.parquet
│   ├── image_locs.parquet
│   └── meta.csv.gz
└── orf
    ├── features.parquet
    ├── image_locs.parquet
    └── meta.csv.gz

Where features and meta files were obtained from the following urls:

File Origin
compound/features.parquet Cell painting gallery
compound/meta.csv.gz GitHub
crispr/features.parquet Cell painting gallery
crispr/meta.csv.gz GitHub
orf/features.parquet Cell painting gallery
orf/meta.csv.gz GitHub

*/image_locs.parquet and annotations/* files are originally produced with this project.

Processed data

The processed data includes:

  • Cell painting feature vectors representing genes and compounds.
  • gene-gene, compound-compound, and compound-gene annotations collected from seven databases.
  • random, cold-source and cold-target data splits.
  • Cell painting images catalog(i.e. list of images used to compute features).

The file structure in the processed data folder matches the output of the snakemake pipeline, which downloads, preprocesses, and merges the collected annotations and Cell Painting profiles into 4 graph datasets with the following organization.

format f"{graph_type}/{gene_repr}/{data_split}/"

data
├── bipartite
│   ├── crispr
│   │   ├── random
│   │   │   └── s_t_labels.parquet
│   │   ├── source
│   │   │   └── s_t_labels.parquet
│   │   ├── target
│   │   │   └── s_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   └── target.parquet
│   └── orf
│       ├── random
│       │   └── s_t_labels.parquet
│       ├── source
│       │   └── s_t_labels.parquet
│       ├── target
│       │   └── s_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       └── target.parquet
├── s_expanded
│   ├── crispr
│   │   ├── random
│   │   │   ├── s_s_labels.parquet
│   │   │   └── s_t_labels.parquet
│   │   ├── source
│   │   │   ├── s_s_labels.parquet
│   │   │   └── s_t_labels.parquet
│   │   ├── target
│   │   │   ├── s_s_labels.parquet
│   │   │   └── s_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_s_labels.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   └── target.parquet
│   └── orf
│       ├── random
│       │   ├── s_s_labels.parquet
│       │   └── s_t_labels.parquet
│       ├── source
│       │   ├── s_s_labels.parquet
│       │   └── s_t_labels.parquet
│       ├── target
│       │   ├── s_s_labels.parquet
│       │   └── s_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_s_labels.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       └── target.parquet
├── st_expanded
│   ├── crispr
│   │   ├── random
│   │   │   ├── s_s_labels.parquet
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source
│   │   │   ├── s_s_labels.parquet
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── target
│   │   │   ├── s_s_labels.parquet
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_s_labels.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   ├── target.parquet
│   │   └── t_t_labels.parquet
│   └── orf
│       ├── random
│       │   ├── s_s_labels.parquet
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source
│       │   ├── s_s_labels.parquet
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── target
│       │   ├── s_s_labels.parquet
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_s_labels.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       ├── target.parquet
│       └── t_t_labels.parquet
├── t_expanded
│   ├── crispr
│   │   ├── random
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── target
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   ├── target.parquet
│   │   └── t_t_labels.parquet
│   └── orf
│       ├── random
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── target
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       ├── target.parquet
│       └── t_t_labels.parquet
├── all_source.parquet
├── all_s_s_labels.parquet
├── all_s_t_labels.parquet
├── crispr_all_target.parquet
├── crispr_all_t_t_labels.parquet
├── orf_all_target.parquet
└── orf_all_t_t_labels.parquet

The files all_source.parquet, crispr_all_target.parquet, and orf_all_target.parquet provide the preprocessed Cell Painting features for the compounds, ORF genes, and CRISPR genes respectively, as the JUMP Cell Painting dataset includes two types of gene edits with unique sets of perturbed genes. These sets of profiles have not been merged with the collected annotations. Example: all_source.parquet.

Metadata_InChIKey X_1 X_2 X_3 X_4 X_5
AAAHWCWPZPSPIW -0.670202 -0.294126 0.13568 0.361892 -0.129554
AAAJHRMBUHXWLD -0.252772 -0.40035 0.477609 0.119673 0.826707
AAANUZMCJQUYNX -0.647198 0.772873 -0.748882 0.615011 -1.11449
AAAQFGUYHFJNHI 0.609696 -0.628469 0.18629 -0.051378 0.0620746
AAAROXVLYNJINN -0.433789 -0.156615 -0.246133 -0.175687 -0.352545

The files all_s_t_labels.parquet, all_s_s_labels.parquet, crispr_all_t_t_labels.parquet, and orf_all_t_t_labels.parquet give the full set of compound and gene annotations that we collected from the 7 KGs and databases before they are merged with the Cell Painting compounds and genes. Metadata_InChIKey uniquely identifies each compound and Metadata_Symbol uniquely identifies each gene. The database column indicates where the interaction was sourced from, but note that if one interaction belongs to multiple sources, only the first instance is kept. Example: all_s_t_labels.parquet.

source Metadata_Symbol rel_type source_id database Metadata_InChIKey
0 DB01776 INS DRUG_TARGET drugbank biokg RLSSMJSEOOYNOY
1 DB02379 LCTL DRUG_TARGET drugbank biokg WQZGKKKJIJFFOK
2 DB07812 AKT2 DRUG_TARGET drugbank biokg TWYNGDRSMHRPSY
3 DB00142 PSAT1 DRUG_TARGET drugbank biokg WHUUTDBJXJRKMK
4 DB13919 AGTR1 DRUG_TARGET drugbank biokg HTQMVQVXFRQIKW

The folders bipartite (source-target),s_expanded (source-target and source-source),t_expanded (source-target and target-target) andst_expanded (source-target, source-source, and target-target) store the files for each graph type. st_expanded is the most thorough graph which includes all of the data in MOTI$\mathcal{VE}$, but others have been provided for ablation studies or different applications. Each graph type folder also includes two subfolders, crisprand orf, which give the gene pertubation type.

Within these folders, the files source.parquet and target.parquet provide the raw features for each source (compound, indexed by Metadata_InChIKey) and target (gene, indexed by Metadata_Symbol). Additionally, the files source_map.parquet and target_map.parquet map each InChIKey or Gene Symbol to its node index in the graph, as all nodes are referred to by only their index numbers in the edge files. Example: source_map.parquet

Metadata_InChIKey 0
AAAQFGUYHFJNHI 0
AAFJXZWCNVJTMK 1
AAKJLRGGTJKAMG 2
AAQOQKQBGPPFNS 3
ABACVOXFUHDKNZ 4

We also provide the parquet files for each relevant edge type in these folders, as well as their pre-split versions for the random, source, and target data splits. The splits are provided for consistency and reproducibility of our results and others' results. Example: source/s_t_labels.parquet.

source target subset
0 0 489 train
1 0 3018 message
2 0 7075 valid
3 0 10158 message
4 0 9883 test

All of the code to generate the processed files is accessible in the MOTI$\mathcal{VE}$ repository.

Clone this wiki locally