Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deciding on datasets to use #2

Open
shntnu opened this issue Feb 16, 2022 · 11 comments
Open

Deciding on datasets to use #2

shntnu opened this issue Feb 16, 2022 · 11 comments
Labels
question Further information is requested

Comments

@shntnu
Copy link
Collaborator

shntnu commented Feb 16, 2022

@EchteRobert asked

last week I believe you mentioned that we should perhaps pick a different dataset to start than the Stain5(?) dataset. Do you remember which one(s) you had in mind instead?

@niranjchandrasekaran had said:

Here are some options
Stain2 and Stain3 - Lots of different experimental conditions; 1 plate per condition (no replicates). These two could be good datasets to burn through while Robert is coming up with his methods.
Stain4, Plate1, Reagent1 and Stain5 - Lots of different conditions; 3 or 4 replicate plates per condition. I guess Stain5 is the best dataset, so it could perhaps be used as a holdout set.

Based on this, I vote for starting with Stain2, @EchteRobert

@EchteRobert
Copy link
Collaborator

LGTM. I see that Percent Strong is used. Could you remind me what that considers as replicates? I can't find it in the issue https://github.com/jump-cellpainting/pilot-analysis/issues/15

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 16, 2022

@niranjchandrasekaran
Copy link
Collaborator

@EchteRobert Percent Strong in that analysis is the same as Percent Replicating that you have been using so far. You can use Metadata_broad_sample as the grouping feature for grouping replicates.

@EchteRobert
Copy link
Collaborator

Great! Thank you both!

@EchteRobert
Copy link
Collaborator

@niranjchandrasekaran I can't find the platemap or metadata for Stain2 on the pilot-analysis GitHub. It seems to have been removed as the notebook refers to one of those pages. Do you know if I can find it somewhere else?

@niranjchandrasekaran
Copy link
Collaborator

@EchteRobert Here are the platemap and metadata files - https://github.com/jump-cellpainting/JUMP-MOA

@EchteRobert
Copy link
Collaborator

@shntnu @niranjchandrasekaran This is the list I made of all the feature columns in the profiles that are available in the aws s3 cp s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/ bucket:

BR00112197binned.csv - 4295 columns - 4293 features
BR00112200.csv - 3530 columns - 3528 features
BR00112203.csv - 4293 features
BR00112199.csv - 4293 features
BR00113818.csv - 4293 features
BR00113819.csv - 4293 features
BR00113820.csv - 4293 features
BR00113821.csv - 4293 features
BR00112197repeat.csv - 4293 features
BR00112197standard.csv - 4293 features
BR00112198.csv - 4293 features
BR00112201.csv - 4293 features
BR00112202.csv - 4293 features
BR00112204.csv - 4293 features

Should I just remove the BR00112200 plate from my data pool and then moving forward expect that the other Stain experiments will have the same 4293 features? Or do you think that these features will change?

@niranjchandrasekaran
Copy link
Collaborator

niranjchandrasekaran commented Feb 25, 2022

According to this issue plate BR00112200 only has 4 channels, which explains the number of features.

@EchteRobert I believe, between Stain4 and Stain5, we updated the feature extraction pipeline and therefore the number of features changes in Stain5 (5794 columns; it is likely that CPJUMP1 also has the same number of columns). Hence Stain2-4 will likely have 4293 features but if it is not too difficult, I would suggest that you quickly check all the plates in Stain3 and 4 before proceeding further.

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 25, 2022

Thanks @niranjchandrasekaran

@EchteRobert here's a way to do it quickly

This command

aws s3 ls --recursive s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/|grep backend|grep csv|grep Stain|tr -s " "|cut -d" " -f4|parallel --keep-order "echo -n {1}; aws s3 cp  s3://cellpainting-gallery/{1} -|csvcut -n|wc -l"|grep -v "download failed"|tr -s " "|tr " " ","|csvcut -c 2,1|sort -n  > ~/Desktop/stain.csv

produces stain.csv, and then counting reveals that all Stain5 are 5794 columns as Niranj said, and the remaining indeed all have 4295 columns (i.e. 4293 features + 2 metadata columns)

cat ~/Desktop/stain.csv |csvcut -c1|sort|uniq -c
   1 3530
  60 4295
  60 5794

@EchteRobert
Copy link
Collaborator

Whoa! I didn't know such magic was possible with aws! Thank you, that saves me some time as I was going to do it manually... I'll divide the data in those two groups then

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 25, 2022

All that magic is bash, not AWS :) (except the bit about aws s3 cp <object> - to cat the <object>)

@EchteRobert EchteRobert added the question Further information is requested label Apr 20, 2022
EchteRobert added a commit that referenced this issue Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants