Home

Welcome to MOTI$\mathcal{VE}$!

MOTI$\mathcal{VE}$ is a Drug Target Interaction graph dataset of compounds and genes.

All of the input and processed data files can found in the Cell Painting Gallery.

Data

Both the input data and processed data files are provided. The inputs folder contains all the annotations collected with their original identifiers, along with features at a more granular level (well-based). The data folder contains the processed features, the standardized annotations, and the data splits ready to be used in the DTI prediction task.

Input data

The input data includes:

Cell profiler features (Well level)

~800K compound-gene annotations of the form:

source	target	rel_type	source_id	database	inchikey
DB12026	SLCO1B3	DRUG_TRANSPORTER	drugbank	biokg	MZBLZLWXUBZHSL-FZNJKFJKSA-N
CHEMBL126075	ATAD5	unknown	chembl	dgidb	QVZCXCJXTMIDME-UHFFFAOYSA-N
5481350	CYP1A2	target	pubchem	drugrep	JFVZFKDSXNQEJW-CQSZACIVSA-N
DB08912	HSPB2	CuG	drugbank	hetionet	BFSMGDJOXZAERB-UHFFFAOYSA-N
10251	CCND1	DRUG_INHIBITION_GENE	pubchem	openbiolink	ZONYXWQDUYMKFB-UHFFFAOYSA-N
DB00643	STK25	UPREGULATES_CHuG	drugbank	pharmebinet	OPXLLQIJSORQAM-UHFFFAOYSA-N
DB11796	ABCG2	transporter	drugbank	primekg	SWMDAPWAQQTBOG-UHFFFAOYSA-N

~3M gene-gene annotations of the form:

target_a	target_b	rel_type	database
LHX8	ABI2	PPI	biokg
UBC	PUS1	interacts	hetionet
IFNA10	IFNA21	GENE_CATALYSIS_GENE	openbiolink
CSNK2A1	CDC37	INTERACTS_GiG	pharmebinet
SAR1A	STX8	ppi	primekg

~10M compound-compound annotations of the form:

source_a	source_b	rel_type	source_id	database	inchikey_a	inchikey_b
DB01531	DB01587	DDI	drugbank	biokg	LNNWVNGFPYWNQE-GMIGKAJZSA-N	PWAJCNITSBZRBL-UHFFFAOYSA-N
DB01333	DB01060	resembles	drugbank	hetionet	RDLPVSKMFDYCOR-UEKVPHQBSA-N	LSQZJLSUYDQPKJ-NJBDSQKTSA-N
DB11609	DB12554	INTERACTS_CiC	drugbank	pharmebinet	WCJFBSYALHQBSK-UHFFFAOYSA-N	VYVKHNNGDFVQGA-UHFFFAOYSA-N
DB00532	DB11609	synergistic interaction	drugbank	primekg	GMHKMTDVRCWUDX-UHFFFAOYSA-N	WCJFBSYALHQBSK-UHFFFAOYSA-N

Files origin

The inputs folder structure is as follows:

inputs
├── annotations
│   ├── compound_compound.parquet
│   ├── compound_gene.parquet
│   └── gene_gene.parquet
├── compound
│   ├── features.parquet
│   ├── image_locs.parquet
│   └── meta.csv.gz
├── crispr
│   ├── features.parquet
│   ├── image_locs.parquet
│   └── meta.csv.gz
└── orf
    ├── features.parquet
    ├── image_locs.parquet
    └── meta.csv.gz

Where features and meta files were obtained from the following urls:

File	Origin
`compound/features.parquet`	Cell painting gallery
`compound/meta.csv.gz`	GitHub
`crispr/features.parquet`	Cell painting gallery
`crispr/meta.csv.gz`	GitHub
`orf/features.parquet`	Cell painting gallery
`orf/meta.csv.gz`	GitHub

*/image_locs.parquet and annotations/* files are originally produced with this project.

Processed data

The processed data includes:

Cell painting feature vectors representing genes and compounds.
gene-gene, compound-compound, and compound-gene annotations collected from seven databases.
random, cold-source and cold-target data splits.
Cell painting images catalog(i.e. list of images used to compute features).

The file structure in the processed data folder matches the output of the snakemake pipeline, which downloads, preprocesses, and merges the collected annotations and Cell Painting profiles into 4 graph datasets with the following organization.

format f"{graph_type}/{gene_repr}/{data_split}/"

data
├── bipartite
│   ├── crispr
│   │   ├── random
│   │   │   └── s_t_labels.parquet
│   │   ├── source
│   │   │   └── s_t_labels.parquet
│   │   ├── target
│   │   │   └── s_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   └── target.parquet
│   └── orf
│       ├── random
│       │   └── s_t_labels.parquet
│       ├── source
│       │   └── s_t_labels.parquet
│       ├── target
│       │   └── s_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       └── target.parquet
├── s_expanded
│   ├── crispr
│   │   ├── random
│   │   │   ├── s_s_labels.parquet
│   │   │   └── s_t_labels.parquet
│   │   ├── source
│   │   │   ├── s_s_labels.parquet
│   │   │   └── s_t_labels.parquet
│   │   ├── target
│   │   │   ├── s_s_labels.parquet
│   │   │   └── s_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_s_labels.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   └── target.parquet
│   └── orf
│       ├── random
│       │   ├── s_s_labels.parquet
│       │   └── s_t_labels.parquet
│       ├── source
│       │   ├── s_s_labels.parquet
│       │   └── s_t_labels.parquet
│       ├── target
│       │   ├── s_s_labels.parquet
│       │   └── s_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_s_labels.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       └── target.parquet
├── st_expanded
│   ├── crispr
│   │   ├── random
│   │   │   ├── s_s_labels.parquet
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source
│   │   │   ├── s_s_labels.parquet
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── target
│   │   │   ├── s_s_labels.parquet
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_s_labels.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   ├── target.parquet
│   │   └── t_t_labels.parquet
│   └── orf
│       ├── random
│       │   ├── s_s_labels.parquet
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source
│       │   ├── s_s_labels.parquet
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── target
│       │   ├── s_s_labels.parquet
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_s_labels.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       ├── target.parquet
│       └── t_t_labels.parquet
├── t_expanded
│   ├── crispr
│   │   ├── random
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── target
│   │   │   ├── s_t_labels.parquet
│   │   │   └── t_t_labels.parquet
│   │   ├── source_map.parquet
│   │   ├── source.parquet
│   │   ├── s_t_labels.parquet
│   │   ├── target_map.parquet
│   │   ├── target.parquet
│   │   └── t_t_labels.parquet
│   └── orf
│       ├── random
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── target
│       │   ├── s_t_labels.parquet
│       │   └── t_t_labels.parquet
│       ├── source_map.parquet
│       ├── source.parquet
│       ├── s_t_labels.parquet
│       ├── target_map.parquet
│       ├── target.parquet
│       └── t_t_labels.parquet
├── all_source.parquet
├── all_s_s_labels.parquet
├── all_s_t_labels.parquet
├── crispr_all_target.parquet
├── crispr_all_t_t_labels.parquet
├── orf_all_target.parquet
└── orf_all_t_t_labels.parquet

The files all_source.parquet, crispr_all_target.parquet, and orf_all_target.parquet provide the preprocessed Cell Painting features for the compounds, ORF genes, and CRISPR genes respectively, as the JUMP Cell Painting dataset includes two types of gene edits with unique sets of perturbed genes. These sets of profiles have not been merged with the collected annotations. Example: all_source.parquet.

Metadata_InChIKey	X_1	X_2	X_3	X_4	X_5
AAAHWCWPZPSPIW	-0.670202	-0.294126	0.13568	0.361892	-0.129554
AAAJHRMBUHXWLD	-0.252772	-0.40035	0.477609	0.119673	0.826707
AAANUZMCJQUYNX	-0.647198	0.772873	-0.748882	0.615011	-1.11449
AAAQFGUYHFJNHI	0.609696	-0.628469	0.18629	-0.051378	0.0620746
AAAROXVLYNJINN	-0.433789	-0.156615	-0.246133	-0.175687	-0.352545

The files all_s_t_labels.parquet, all_s_s_labels.parquet, crispr_all_t_t_labels.parquet, and orf_all_t_t_labels.parquet give the full set of compound and gene annotations that we collected from the 7 KGs and databases before they are merged with the Cell Painting compounds and genes. Metadata_InChIKey uniquely identifies each compound and Metadata_Symbol uniquely identifies each gene. The database column indicates where the interaction was sourced from, but note that if one interaction belongs to multiple sources, only the first instance is kept. Example: all_s_t_labels.parquet.

	source	Metadata_Symbol	rel_type	source_id	database	Metadata_InChIKey
0	DB01776	INS	DRUG_TARGET	drugbank	biokg	RLSSMJSEOOYNOY
1	DB02379	LCTL	DRUG_TARGET	drugbank	biokg	WQZGKKKJIJFFOK
2	DB07812	AKT2	DRUG_TARGET	drugbank	biokg	TWYNGDRSMHRPSY
3	DB00142	PSAT1	DRUG_TARGET	drugbank	biokg	WHUUTDBJXJRKMK
4	DB13919	AGTR1	DRUG_TARGET	drugbank	biokg	HTQMVQVXFRQIKW

The folders bipartite (source-target),s_expanded (source-target and source-source),t_expanded (source-target and target-target) andst_expanded (source-target, source-source, and target-target) store the files for each graph type. st_expanded is the most thorough graph which includes all of the data in MOTI$\mathcal{VE}$, but others have been provided for ablation studies or different applications. Each graph type folder also includes two subfolders, crisprand orf, which give the gene pertubation type.

Within these folders, the files source.parquet and target.parquet provide the raw features for each source (compound, indexed by Metadata_InChIKey) and target (gene, indexed by Metadata_Symbol). Additionally, the files source_map.parquet and target_map.parquet map each InChIKey or Gene Symbol to its node index in the graph, as all nodes are referred to by only their index numbers in the edge files. Example: source_map.parquet

Metadata_InChIKey	0
AAAQFGUYHFJNHI	0
AAFJXZWCNVJTMK	1
AAKJLRGGTJKAMG	2
AAQOQKQBGPPFNS	3
ABACVOXFUHDKNZ	4

We also provide the parquet files for each relevant edge type in these folders, as well as their pre-split versions for the random, source, and target data splits. The splits are provided for consistency and reproducibility of our results and others' results. Example: source/s_t_labels.parquet.

	target	subset
0	489	train
1	3018	message
2	7075	valid
3	10158	message
4	9883	test

All of the code to generate the processed files is accessible in the MOTI$\mathcal{VE}$ repository.

Provide feedback

Saved searches