Deciding on datasets to use #2

shntnu · 2022-02-16T20:54:36Z

last week I believe you mentioned that we should perhaps pick a different dataset to start than the Stain5(?) dataset. Do you remember which one(s) you had in mind instead?

@niranjchandrasekaran had said:

Here are some options
Stain2 and Stain3 - Lots of different experimental conditions; 1 plate per condition (no replicates). These two could be good datasets to burn through while Robert is coming up with his methods.
Stain4, Plate1, Reagent1 and Stain5 - Lots of different conditions; 3 or 4 replicate plates per condition. I guess Stain5 is the best dataset, so it could perhaps be used as a holdout set.

Based on this, I vote for starting with Stain2, @EchteRobert

EchteRobert · 2022-02-16T21:58:56Z

LGTM. I see that Percent Strong is used. Could you remind me what that considers as replicates? I can't find it in the issue https://github.com/jump-cellpainting/pilot-analysis/issues/15

shntnu · 2022-02-16T22:02:35Z

Here's the analysis
https://github.com/jump-cellpainting/pilot-analysis/blob/master/1.cpjump-stain2/0.analyze-cpjump-stain2.ipynb

Does that help? Otherwise, might need @niranjchandrasekaran to clarify

niranjchandrasekaran · 2022-02-16T22:08:25Z

@EchteRobert Percent Strong in that analysis is the same as Percent Replicating that you have been using so far. You can use Metadata_broad_sample as the grouping feature for grouping replicates.

EchteRobert · 2022-02-16T22:09:48Z

Great! Thank you both!

EchteRobert · 2022-02-17T13:15:56Z

@niranjchandrasekaran I can't find the platemap or metadata for Stain2 on the pilot-analysis GitHub. It seems to have been removed as the notebook refers to one of those pages. Do you know if I can find it somewhere else?

niranjchandrasekaran · 2022-02-17T15:02:54Z

@EchteRobert Here are the platemap and metadata files - https://github.com/jump-cellpainting/JUMP-MOA

EchteRobert · 2022-02-24T21:38:44Z

@shntnu @niranjchandrasekaran This is the list I made of all the feature columns in the profiles that are available in the aws s3 cp s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/ bucket:

BR00112197binned.csv - 4295 columns - 4293 features
BR00112200.csv - 3530 columns - 3528 features
BR00112203.csv - 4293 features
BR00112199.csv - 4293 features
BR00113818.csv - 4293 features
BR00113819.csv - 4293 features
BR00113820.csv - 4293 features
BR00113821.csv - 4293 features
BR00112197repeat.csv - 4293 features
BR00112197standard.csv - 4293 features
BR00112198.csv - 4293 features
BR00112201.csv - 4293 features
BR00112202.csv - 4293 features
BR00112204.csv - 4293 features

Should I just remove the BR00112200 plate from my data pool and then moving forward expect that the other Stain experiments will have the same 4293 features? Or do you think that these features will change?

niranjchandrasekaran · 2022-02-25T01:24:37Z

According to this issue plate BR00112200 only has 4 channels, which explains the number of features.

@EchteRobert I believe, between Stain4 and Stain5, we updated the feature extraction pipeline and therefore the number of features changes in Stain5 (5794 columns; it is likely that CPJUMP1 also has the same number of columns). Hence Stain2-4 will likely have 4293 features but if it is not too difficult, I would suggest that you quickly check all the plates in Stain3 and 4 before proceeding further.

shntnu · 2022-02-25T02:58:05Z

Thanks @niranjchandrasekaran

@EchteRobert here's a way to do it quickly

This command

aws s3 ls --recursive s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/|grep backend|grep csv|grep Stain|tr -s " "|cut -d" " -f4|parallel --keep-order "echo -n {1}; aws s3 cp  s3://cellpainting-gallery/{1} -|csvcut -n|wc -l"|grep -v "download failed"|tr -s " "|tr " " ","|csvcut -c 2,1|sort -n  > ~/Desktop/stain.csv

produces stain.csv, and then counting reveals that all Stain5 are 5794 columns as Niranj said, and the remaining indeed all have 4295 columns (i.e. 4293 features + 2 metadata columns)

cat ~/Desktop/stain.csv |csvcut -c1|sort|uniq -c
   1 3530
  60 4295
  60 5794

EchteRobert · 2022-02-25T14:21:16Z

Whoa! I didn't know such magic was possible with aws! Thank you, that saves me some time as I was going to do it manually... I'll divide the data in those two groups then

shntnu · 2022-02-25T17:07:47Z

All that magic is bash, not AWS :) (except the bit about aws s3 cp <object> - to cat the <object>)

First commit

EchteRobert added the question Further information is requested label Apr 20, 2022

EchteRobert added a commit that referenced this issue Aug 30, 2022

Merge pull request #2 from EchteRobert/Development

27a3323

First commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deciding on datasets to use #2

Deciding on datasets to use #2

shntnu commented Feb 16, 2022 •

edited

Loading

EchteRobert commented Feb 16, 2022

shntnu commented Feb 16, 2022

niranjchandrasekaran commented Feb 16, 2022

EchteRobert commented Feb 16, 2022

EchteRobert commented Feb 17, 2022

niranjchandrasekaran commented Feb 17, 2022

EchteRobert commented Feb 24, 2022

niranjchandrasekaran commented Feb 25, 2022 •

edited

Loading

shntnu commented Feb 25, 2022 •

edited

Loading

EchteRobert commented Feb 25, 2022

shntnu commented Feb 25, 2022 •

edited

Loading

Deciding on datasets to use #2

Deciding on datasets to use #2

Comments

shntnu commented Feb 16, 2022 • edited Loading

EchteRobert commented Feb 16, 2022

shntnu commented Feb 16, 2022

niranjchandrasekaran commented Feb 16, 2022

EchteRobert commented Feb 16, 2022

EchteRobert commented Feb 17, 2022

niranjchandrasekaran commented Feb 17, 2022

EchteRobert commented Feb 24, 2022

niranjchandrasekaran commented Feb 25, 2022 • edited Loading

shntnu commented Feb 25, 2022 • edited Loading

EchteRobert commented Feb 25, 2022

shntnu commented Feb 25, 2022 • edited Loading

shntnu commented Feb 16, 2022 •

edited

Loading

niranjchandrasekaran commented Feb 25, 2022 •

edited

Loading

shntnu commented Feb 25, 2022 •

edited

Loading

shntnu commented Feb 25, 2022 •

edited

Loading