-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed human genome downloading and added auto-masking feature using dustmasker #57
base: master
Are you sure you want to change the base?
Conversation
…posted to kraken-users Google Group: https://groups.google.com/d/msg/kraken-users/wMNMSPo8Xtw/osYcrx90DgAJ
…en database, when available.
…f low-complexity regions, in line with a suggested script from Adam Rivers on kraken-users Google Group, circa March 2016.
Nice! Thanks Tomer - we use dustmasker ourselves in our pipeline, but we hadn't built it into Kraken as an option. I highly recommend 'dust'-ing any genomes before running Kraken (or any competing program) because of the confusion caused by low-complexity sequences. |
Thanks for your feedback, Dr. Salzberg! As I've communicated to Derrick, Kraken was instrumental in me finishing my PhD, so I'm happy to be able to contribute back. Please let me know if you think that one of the Kraken devs will accept this pull request. I'm curious about your thoughts on masking before or after mapping. I see that Heng Li advises masking after read mapping, and discarding reads that land in masked regions. Is this what you advise with Bowtie2, or do you mask DNA before building a Bowtie2 database? Thanks in advance! |
echo -n "Unpacking..." | ||
tar zxf all.fna.tar.gz | ||
rm all.fna.tar.gz | ||
gunzip *.$EXTENSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default behavior of gunzip is to remove the compressed file, this causes an error when rm is called after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SheaML , thanks for your comment.
In the code, I removed the call to 'tar', because we are no longer downloading the all.fna.tar.gz file. Instead, we are downloading a set of gzipped files, as that is how RefSeq is now representing the data. Thus there's no call to 'rm' after a (sub-process) call to 'gunzip'. I think the pull diff shows it pretty clearly in terms of the red lines, which have been removed.
I'm new to GitHub pull requests, so please excuse me if I've confused something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @taltman the next line you added, line 46, "rm *.$EXTENSION" is the one causing the error for me. Sorry, I'm also a github neophyte.
@SheaML Right you are! I did have a commit locally that I failed to push. Resolved. Thanks for catching that and reporting it! |
Hi Derrick,
I've been using Kraken as of late, and decided to incorporate some fixes. First, I incorporated a suggested fix by Silas Kieser for downloading the human genome, and simplified it to use a single wget call:
https://groups.google.com/d/msg/kraken-users/wMNMSPo8Xtw/osYcrx90DgAJ
Next, I incorporated Adam Rivers' suggestion about how to use dustmasker to mask low-complexity regions into kraken-build, along with adding an option '--no-mask' to turn off masking if desired for reproducibility. The software reverts to no masking if dustmasker is not found.
https://groups.google.com/d/msg/kraken-users/jjRe21-qyvw/Kq8DXY45CQAJ
I also updated the documentation to reflect the new soft dependency on dustmasker, and documented the --no-mask option.
Please let me know what you think!
Cheers,
~Tomer