-
Notifications
You must be signed in to change notification settings - Fork 1
Home
MOTI
All of the input and processed data files can found in the Cell Painting Gallery.
Both the input data and processed data files are provided. The inputs folder contains all the annotations collected with their original identifiers, along with features at a more granular level (well-based). The data folder contains the processed features, the standardized annotations, and the data splits ready to be used in the DTI prediction task.
The input data includes:
- Cell profiler features (Well level)
- ~800K compound-gene annotations of the form:
source target rel_type source_id database inchikey DB12026 SLCO1B3 DRUG_TRANSPORTER drugbank biokg MZBLZLWXUBZHSL-FZNJKFJKSA-N CHEMBL126075 ATAD5 unknown chembl dgidb QVZCXCJXTMIDME-UHFFFAOYSA-N 5481350 CYP1A2 target pubchem drugrep JFVZFKDSXNQEJW-CQSZACIVSA-N DB08912 HSPB2 CuG drugbank hetionet BFSMGDJOXZAERB-UHFFFAOYSA-N 10251 CCND1 DRUG_INHIBITION_GENE pubchem openbiolink ZONYXWQDUYMKFB-UHFFFAOYSA-N DB00643 STK25 UPREGULATES_CHuG drugbank pharmebinet OPXLLQIJSORQAM-UHFFFAOYSA-N DB11796 ABCG2 transporter drugbank primekg SWMDAPWAQQTBOG-UHFFFAOYSA-N - ~3M gene-gene annotations of the form:
target_a target_b rel_type database LHX8 ABI2 PPI biokg UBC PUS1 interacts hetionet IFNA10 IFNA21 GENE_CATALYSIS_GENE openbiolink CSNK2A1 CDC37 INTERACTS_GiG pharmebinet SAR1A STX8 ppi primekg - ~10M compound-compound annotations of the form:
source_a source_b rel_type source_id database inchikey_a inchikey_b DB01531 DB01587 DDI drugbank biokg LNNWVNGFPYWNQE-GMIGKAJZSA-N PWAJCNITSBZRBL-UHFFFAOYSA-N DB01333 DB01060 resembles drugbank hetionet RDLPVSKMFDYCOR-UEKVPHQBSA-N LSQZJLSUYDQPKJ-NJBDSQKTSA-N DB11609 DB12554 INTERACTS_CiC drugbank pharmebinet WCJFBSYALHQBSK-UHFFFAOYSA-N VYVKHNNGDFVQGA-UHFFFAOYSA-N DB00532 DB11609 synergistic interaction drugbank primekg GMHKMTDVRCWUDX-UHFFFAOYSA-N WCJFBSYALHQBSK-UHFFFAOYSA-N
The inputs
folder structure is as follows:
inputs
├── annotations
│ ├── compound_compound.parquet
│ ├── compound_gene.parquet
│ └── gene_gene.parquet
├── compound
│ ├── features.parquet
│ ├── image_locs.parquet
│ └── meta.csv.gz
├── crispr
│ ├── features.parquet
│ ├── image_locs.parquet
│ └── meta.csv.gz
└── orf
├── features.parquet
├── image_locs.parquet
└── meta.csv.gz
Where features
and meta
files were obtained from the following urls:
File | Origin |
---|---|
compound/features.parquet |
Cell painting gallery |
compound/meta.csv.gz |
GitHub |
crispr/features.parquet |
Cell painting gallery |
crispr/meta.csv.gz |
GitHub |
orf/features.parquet |
Cell painting gallery |
orf/meta.csv.gz |
GitHub |
*/image_locs.parquet
and annotations/*
files are originally produced with this project.
The processed data includes:
- Cell painting feature vectors representing genes and compounds.
- gene-gene, compound-compound, and compound-gene annotations collected from seven databases.
- random, cold-source and cold-target data splits.
- Cell painting images catalog(i.e. list of images used to compute features).
The file structure in the processed data folder matches the output of the snakemake pipeline, which downloads, preprocesses, and merges the collected annotations and Cell Painting profiles into 4 graph datasets with the following organization.
format f"{graph_type}/{gene_repr}/{data_split}/"
data
├── bipartite
│ ├── crispr
│ │ ├── random
│ │ │ └── s_t_labels.parquet
│ │ ├── source
│ │ │ └── s_t_labels.parquet
│ │ ├── target
│ │ │ └── s_t_labels.parquet
│ │ ├── source_map.parquet
│ │ ├── source.parquet
│ │ ├── s_t_labels.parquet
│ │ ├── target_map.parquet
│ │ └── target.parquet
│ └── orf
│ ├── random
│ │ └── s_t_labels.parquet
│ ├── source
│ │ └── s_t_labels.parquet
│ ├── target
│ │ └── s_t_labels.parquet
│ ├── source_map.parquet
│ ├── source.parquet
│ ├── s_t_labels.parquet
│ ├── target_map.parquet
│ └── target.parquet
├── s_expanded
│ ├── crispr
│ │ ├── random
│ │ │ ├── s_s_labels.parquet
│ │ │ └── s_t_labels.parquet
│ │ ├── source
│ │ │ ├── s_s_labels.parquet
│ │ │ └── s_t_labels.parquet
│ │ ├── target
│ │ │ ├── s_s_labels.parquet
│ │ │ └── s_t_labels.parquet
│ │ ├── source_map.parquet
│ │ ├── source.parquet
│ │ ├── s_s_labels.parquet
│ │ ├── s_t_labels.parquet
│ │ ├── target_map.parquet
│ │ └── target.parquet
│ └── orf
│ ├── random
│ │ ├── s_s_labels.parquet
│ │ └── s_t_labels.parquet
│ ├── source
│ │ ├── s_s_labels.parquet
│ │ └── s_t_labels.parquet
│ ├── target
│ │ ├── s_s_labels.parquet
│ │ └── s_t_labels.parquet
│ ├── source_map.parquet
│ ├── source.parquet
│ ├── s_s_labels.parquet
│ ├── s_t_labels.parquet
│ ├── target_map.parquet
│ └── target.parquet
├── st_expanded
│ ├── crispr
│ │ ├── random
│ │ │ ├── s_s_labels.parquet
│ │ │ ├── s_t_labels.parquet
│ │ │ └── t_t_labels.parquet
│ │ ├── source
│ │ │ ├── s_s_labels.parquet
│ │ │ ├── s_t_labels.parquet
│ │ │ └── t_t_labels.parquet
│ │ ├── target
│ │ │ ├── s_s_labels.parquet
│ │ │ ├── s_t_labels.parquet
│ │ │ └── t_t_labels.parquet
│ │ ├── source_map.parquet
│ │ ├── source.parquet
│ │ ├── s_s_labels.parquet
│ │ ├── s_t_labels.parquet
│ │ ├── target_map.parquet
│ │ ├── target.parquet
│ │ └── t_t_labels.parquet
│ └── orf
│ ├── random
│ │ ├── s_s_labels.parquet
│ │ ├── s_t_labels.parquet
│ │ └── t_t_labels.parquet
│ ├── source
│ │ ├── s_s_labels.parquet
│ │ ├── s_t_labels.parquet
│ │ └── t_t_labels.parquet
│ ├── target
│ │ ├── s_s_labels.parquet
│ │ ├── s_t_labels.parquet
│ │ └── t_t_labels.parquet
│ ├── source_map.parquet
│ ├── source.parquet
│ ├── s_s_labels.parquet
│ ├── s_t_labels.parquet
│ ├── target_map.parquet
│ ├── target.parquet
│ └── t_t_labels.parquet
├── t_expanded
│ ├── crispr
│ │ ├── random
│ │ │ ├── s_t_labels.parquet
│ │ │ └── t_t_labels.parquet
│ │ ├── source
│ │ │ ├── s_t_labels.parquet
│ │ │ └── t_t_labels.parquet
│ │ ├── target
│ │ │ ├── s_t_labels.parquet
│ │ │ └── t_t_labels.parquet
│ │ ├── source_map.parquet
│ │ ├── source.parquet
│ │ ├── s_t_labels.parquet
│ │ ├── target_map.parquet
│ │ ├── target.parquet
│ │ └── t_t_labels.parquet
│ └── orf
│ ├── random
│ │ ├── s_t_labels.parquet
│ │ └── t_t_labels.parquet
│ ├── source
│ │ ├── s_t_labels.parquet
│ │ └── t_t_labels.parquet
│ ├── target
│ │ ├── s_t_labels.parquet
│ │ └── t_t_labels.parquet
│ ├── source_map.parquet
│ ├── source.parquet
│ ├── s_t_labels.parquet
│ ├── target_map.parquet
│ ├── target.parquet
│ └── t_t_labels.parquet
├── all_source.parquet
├── all_s_s_labels.parquet
├── all_s_t_labels.parquet
├── crispr_all_target.parquet
├── crispr_all_t_t_labels.parquet
├── orf_all_target.parquet
└── orf_all_t_t_labels.parquet
The files all_source.parquet
, crispr_all_target.parquet
, and orf_all_target.parquet
provide the preprocessed Cell Painting features for the compounds, ORF genes, and CRISPR genes respectively, as the JUMP Cell Painting dataset includes two types of gene edits with unique sets of perturbed genes. These sets of profiles have not been merged with the collected annotations. Example: all_source.parquet
.
Metadata_InChIKey | X_1 | X_2 | X_3 | X_4 | X_5 |
---|---|---|---|---|---|
AAAHWCWPZPSPIW | -0.670202 | -0.294126 | 0.13568 | 0.361892 | -0.129554 |
AAAJHRMBUHXWLD | -0.252772 | -0.40035 | 0.477609 | 0.119673 | 0.826707 |
AAANUZMCJQUYNX | -0.647198 | 0.772873 | -0.748882 | 0.615011 | -1.11449 |
AAAQFGUYHFJNHI | 0.609696 | -0.628469 | 0.18629 | -0.051378 | 0.0620746 |
AAAROXVLYNJINN | -0.433789 | -0.156615 | -0.246133 | -0.175687 | -0.352545 |
The files all_s_t_labels.parquet
, all_s_s_labels.parquet
, crispr_all_t_t_labels.parquet
, and orf_all_t_t_labels.parquet
give the full set of compound and gene annotations that we collected from the 7 KGs and databases before they are merged with the Cell Painting compounds and genes. Metadata_InChIKey
uniquely identifies each compound and Metadata_Symbol
uniquely identifies each gene. The database
column indicates where the interaction was sourced from, but note that if one interaction belongs to multiple sources, only the first instance is kept. Example: all_s_t_labels.parquet
.
source | Metadata_Symbol | rel_type | source_id | database | Metadata_InChIKey | |
---|---|---|---|---|---|---|
0 | DB01776 | INS | DRUG_TARGET | drugbank | biokg | RLSSMJSEOOYNOY |
1 | DB02379 | LCTL | DRUG_TARGET | drugbank | biokg | WQZGKKKJIJFFOK |
2 | DB07812 | AKT2 | DRUG_TARGET | drugbank | biokg | TWYNGDRSMHRPSY |
3 | DB00142 | PSAT1 | DRUG_TARGET | drugbank | biokg | WHUUTDBJXJRKMK |
4 | DB13919 | AGTR1 | DRUG_TARGET | drugbank | biokg | HTQMVQVXFRQIKW |
The folders bipartite
(source-target),s_expanded
(source-target and source-source),t_expanded
(source-target and target-target) andst_expanded
(source-target, source-source, and target-target) store the files for each graph type. st_expanded
is the most thorough graph which includes all of the data in MOTIcrispr
and orf
, which give the gene pertubation type.
Within these folders, the files source.parquet
and target.parquet
provide the raw features for each source (compound, indexed by Metadata_InChIKey
) and target (gene, indexed by Metadata_Symbol
). Additionally, the files source_map.parquet
and target_map.parquet
map each InChIKey or Gene Symbol to its node index in the graph, as all nodes are referred to by only their index numbers in the edge files. Example: source_map.parquet
Metadata_InChIKey | 0 |
---|---|
AAAQFGUYHFJNHI | 0 |
AAFJXZWCNVJTMK | 1 |
AAKJLRGGTJKAMG | 2 |
AAQOQKQBGPPFNS | 3 |
ABACVOXFUHDKNZ | 4 |
We also provide the parquet files for each relevant edge type in these folders, as well as their pre-split versions for the random, source, and target data splits. The splits are provided for consistency and reproducibility of our results and others' results. Example: source/s_t_labels.parquet
.
source | target | subset | |
---|---|---|---|
0 | 0 | 489 | train |
1 | 0 | 3018 | message |
2 | 0 | 7075 | valid |
3 | 0 | 10158 | message |
4 | 0 | 9883 | test |
All of the code to generate the processed files is accessible in the MOTI