-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deciding on datasets to use #2
Comments
LGTM. I see that Percent Strong is used. Could you remind me what that considers as replicates? I can't find it in the issue https://github.com/jump-cellpainting/pilot-analysis/issues/15 |
Here's the analysis Does that help? Otherwise, might need @niranjchandrasekaran to clarify |
@EchteRobert Percent Strong in that analysis is the same as Percent Replicating that you have been using so far. You can use |
Great! Thank you both! |
@niranjchandrasekaran I can't find the platemap or metadata for Stain2 on the pilot-analysis GitHub. It seems to have been removed as the notebook refers to one of those pages. Do you know if I can find it somewhere else? |
@EchteRobert Here are the platemap and metadata files - https://github.com/jump-cellpainting/JUMP-MOA |
@shntnu @niranjchandrasekaran This is the list I made of all the feature columns in the profiles that are available in the aws s3 cp s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/ bucket: BR00112197binned.csv - 4295 columns - 4293 features Should I just remove the BR00112200 plate from my data pool and then moving forward expect that the other Stain experiments will have the same 4293 features? Or do you think that these features will change? |
According to this issue plate @EchteRobert I believe, between Stain4 and Stain5, we updated the feature extraction pipeline and therefore the number of features changes in Stain5 (5794 columns; it is likely that CPJUMP1 also has the same number of columns). Hence Stain2-4 will likely have 4293 features but if it is not too difficult, I would suggest that you quickly check all the plates in Stain3 and 4 before proceeding further. |
Thanks @niranjchandrasekaran @EchteRobert here's a way to do it quickly This command aws s3 ls --recursive s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/|grep backend|grep csv|grep Stain|tr -s " "|cut -d" " -f4|parallel --keep-order "echo -n {1}; aws s3 cp s3://cellpainting-gallery/{1} -|csvcut -n|wc -l"|grep -v "download failed"|tr -s " "|tr " " ","|csvcut -c 2,1|sort -n > ~/Desktop/stain.csv produces stain.csv, and then counting reveals that all Stain5 are 5794 columns as Niranj said, and the remaining indeed all have 4295 columns (i.e. 4293 features + 2 metadata columns)
|
Whoa! I didn't know such magic was possible with aws! Thank you, that saves me some time as I was going to do it manually... I'll divide the data in those two groups then |
All that magic is bash, not AWS :) (except the bit about |
@EchteRobert asked
@niranjchandrasekaran had said:
Based on this, I vote for starting with Stain2, @EchteRobert
The text was updated successfully, but these errors were encountered: