-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P2 01. Testing the method on subsets of the LINCS dataset #12
Comments
Experiment 0: test the method on 3 plates from LINCS (mixed doses)I will start off with plates SQ00015231, SQ00015232, and SQ00015233. First I will have to remove all low dose (<3.33 µM) wells from each plate. Then I will train the model as usual, using the highest dose (10 µM) replicates for the contrastive learning objective. Finally, the model will be evaluated on MoA prediction using the highest dose (10 µM) wells. I will not use the 3.33 µM wells during training/validation and only use those during the final testing phase. I have now updated the scripts to work for the LINCS data, but I still need to filter the dose points before training. This is next up on the list. I have already trained a model using the original setup but without filtering doses. This means that replicate compound profiles are forced to attract during training, even if they were created with different doses. Main takeawayThe replicate mAP was very high but no improvement was found in MoA prediction (both were also evaluated using different dose points). |
Experiment 1: test the method on 3 plates from LINCS (only 10 µM dose)I trained a model on plates SQ00015231, SQ00015232, and SQ00015233 using only the 10 µM dose point and training on forming replicates between compounds across plates. The latter is new compared to previous experiments. Earlier on when experimenting on the Stain datasets, I found that training on across plate replicates reduced the model's ability to generalize to hold-out compounds. This may also negatively influence the model's ability to predict MoAs. It's possible that more plates are needed for generalization anyways (3 plates did not work well for Stain data either). Main takeawaysThe model improves replicate prediction significantly: it's nearly perfect (1.0 mAP). However, as expected it does not improve MoA prediction. I will probably have to use more plates during training to improve generalization. ResultsmAP of replicate and MoA predictionmAPs replicate prediction
mAP model MoA prediction
mAPs MoA prediction
Baseline LINCSIt looks like the LINCS results were generated by
From the manuscript:
|
Experiment 2: test the method on 6 plates from LINCS (only 10 µM dose)Using plates SQ00015224, SQ00015223, SQ00015230, SQ00015231, SQ00015229, SQ00015233, and SQ00015232 as the training data and considering across plate replicates, the model was trained in the same way as in Exp. 1. Using more plates during training is expected to improve MoA prediction performance compared to Exp. 1. Main takeawaysThe model is still not generalizing to MoA prediction. This is probably due to the lack of data. In the next experiment I will try to use ~30 plates. ResultsThe model is converging slower than the previous models - I think because there are more compounds that can be used for replicate training, increasing the training task complexity. We can also see that there are more MoAs that are taken into account for the evaluation of the MoA prediction task. The model is beating BM performance in the replicating task, but not in the MoA prediction task - the performance for this task remains near random (as in Exp. 1). The total number of samples, due to filtering of lower dose points, is only the size of one plate (~360 samples). This could explain the lack of generalization of the model. It's possible I will need to at least use 5 times as many plates (~30). mAP of replicate and MoA predictionmAPs replicate prediction
mAP model MoA prediction
mAPs MoA prediction
|
Experiment 3: test the method on 26 plates with 1439 wells on the server (bugged)I have now moved the python environment and all of the data to John's server to accommodate for the larger memory requirements of the LINCS dataset. Effectively these 26 plates only correspond to ~4 plates in training data, which is nowhere close to the 15 I used for training the final model on the Stain datasets. I have monitored all the steps I had to take and updated those in the README of this repository for future reference. The model was trained using the same default hyperparameters as before. The updated pipeline uses 1781 features instead of 1783, possibly due to some Image or Metadata features being filtered out correctly now. Main takeawaysSomething went wrong during training. I expect the problem to lie with the data. I will investigate the issue and improve the training pipeline. After investigating a bit, the issue might be related to same compounds getting different labels. ResultsThe loss curves show that the model was not able to learn the task correctly. The validation mAP does keep increasing (although on a much smaller scale than it would normally), which may indicate that the training procedure is correct. I will investigate the training data to see if something is wrong there. |
Experiment 3: test the method on 26 plates with 458 wells on the serverI resolved the issue and reran the experiment as described in #12 (comment). I now aggregate all the wells from all the plates first. I now also add the perturbation and moa information per well and then remove any wells that do not contain any perturbation information. Finally, I remove all compounds which occur only once in the entire dataset. This results in:
Note that this is only equal to ~1.5 plates worth of data! Main takeawaysThe model is still not better at MoA prediction than the benchmark. In previous experiments, the model only started beating this performance after I started using >~9 plates. Another method would be to add more data augmentation, i.e., random sampling from the wells. ResultsThe model now trains correctly and we can see that the loss curves converge properly. Percent replicating is nearly perfect, while the model still lacks improved MoA prediction compared to the benchmark. mean average precisions scoresmAP replicate prediction
*SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 4: test the method on 38 plates with 1008 wellsI added more plates and repeated the experiment. The number of wells used is still only equal to ~3 plates total. I will have to add even more to make this work.
Results here!replicate prediction
MoA prediction
*SQ00015153_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 5: test the method on 81 plates with 3306 - BS 36 and BS 72
Main takeaways
Resultsbatch size 36
Replicate prediction
batch size 72
Replicate prediction
MLP
BM
Plate barcodes*SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015153_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 6: test the method on 108 plates with 4691 wells - BS 72Main takeawaysSimilar to the results from Experiment 5. Next upI will try some different training approaches to see if I can improve model performance further. However, it's possible that this small increase is all that we can get due to the large number of different compounds in the dataset.
I will do the last step last as optimizing these parameters again might mean that the current hyperparameter settings are not robust to new data. ResultsReplicate prediction
MoA prediction
Plate barcodesSQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 6: results updated with randomly shuffled baseline and normalized mAP scoresI recalculated the results with updates to the eval script. A randomly shuffled baseline is also calculated and subtracted from the mAP scores. Here I divide the mAP by the sum of the labels (i.e., the number of positive samples in the rank order). This was incorrect and the mAP is recalculated again in a new comment below (#12 (comment)). Main takeaways
normalized mAP scores (with shuffled baseline)Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=56.98555674994308, pvalue=0.0) MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=4.663893266927154, pvalue=3.176144361395387e-06) Plate barcodesSQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 7: training with data with at least 4 replicatesI now filtered the training data for compounds that had at least 4 replicates (though 4 is the highest number of replicates I found in this dataset). Main takeawaysThe results are worse than using all data. I will try training with normalized mAP scores (with shuffled baseline)Replicate prediction
MoA prediction
Plate barcodesSQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 8: training with only 88 different compounds with each 4 replicatesTo reduce complexity I only train on 10% of the currently preprocessed data. The idea is to mimic the complexity of the JUMP data. However, the amount of data is drastically reduced with respect to JUMP as there are no replicate plates of the same compounds, reducing the total number of replicates of a single compound to 4 instead of 4 times the number of replicate plates. Main takeawaysDue to the lack of training samples, this task is more complex than when using more different compounds but also more samples. In the future, I will use all samples I have at my disposal. ResultsmAP resultsReplicate prediction
MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=1.9471959035173303, pvalue=0.05155618670367708)
Plate barcodesSQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230 |
Experiment 6: recalculating the mAP againI now calculate the mAP by correcting it with the random baseline: Main takeawaysThe mAP is now closer to what would be expected according to the results found in the LINCS manuscript, where the majority of the results are found between 0 and 0.1. Results replicate predictionWelch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=89.0721529525361, pvalue=0.0)
Results MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=5.879117847435402, pvalue=4.351771882547785e-09)
|
Experiment 9: Increasing the level of data augmentationI reran Experiment 6, but now using 8 sets of cells sampled per compound type (instead of 4). To keep the true batch size the same, I reduced the batch size from 72 to 36. The model crashed near the end so the results are inconclusive, but may still give an idea of which direction this type of training is headed. Main takeawaysAlthough training loss and replicate prediction performance are worse than Experiment 6 and even though training was not completed, the mAP for MoA prediction is higher than before (0.0658 instead of 0.0617). This indicates that adding more data augmentation will benefit generalization. Results mAP MoA prediction
Total mean mAP shuffled: 0.0 Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.754822157536942, pvalue=1.5725563418210224e-11)
Results mAP replicate predictionWelch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=78.30556567539497, pvalue=0.0)
|
Experiment 9: continuedI reran the evaluation after completing the training loop (100 epochs) for the model described in Experiment 9. Because I also preprocessed all remaining plates overnight, the evaluation is done on some plates which were not included in training resulting in the evaluation of some compounds which were not seen before. This may deflate the mAP numbers somewhat. Main takeawaysAlthough training loss and replicate prediction performance are worse than in Experiment 6, the mAP for MoA prediction is higher. This indicates that adding more data augmentation will benefit generalization. Next upTrain a model on ALL LINCS data! MoA prediction results
Exp. 9: Exp. 6:
Replicate prediction resultsExp. 9: Exp. 6:
|
LINCS contains 6 dose points: 0.04 µM, 0.12 µM, 0.37 µM, 1.11 µM, 3.33 µM, and 10 µM. For my experiments, I will use the highest dose (10 µM) as the training set and the validation set. The model is trained to create profiles that attract replicate compound profiles and repel non-replicate compound profiles. It is then validated by evaluating the ability of these profiles to predict MoAs (or find sister compounds). Finally, the model will be tested on the 3.3 µM dose point data as a hold-out set. This data should look significantly different from the training and validation data.
I will follow the same data exclusion protocol as Michael did in his research:
Which should result in similar data numbers:
I have put some relevant quotes from Michael's work and the LINCS manuscript here: https://docs.google.com/document/d/1z2U5o91vzBwB-4xtryYn5d3kSWJi8ZSerE_MAzdLT_0/edit#heading=h.sbi6l2r6p5ec
The text was updated successfully, but these errors were encountered: