Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed human genome downloading and added auto-masking feature using dustmasker #57

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
src/classify
src/db_shrink
src/db_sort
src/*.o
src/make_seqid_to_taxid_map
src/set_lcas
14 changes: 13 additions & 1 deletion docs/MANUAL.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,11 @@ read the paragraph about MiniKraken, below.
sort of development package installed will have all of the above
listed programs and libraries available.

Kraken will use the program [dustmasker] when installed to mask
low-complexity regions of library sequences to reduce the number of
false-positive hits. This is highly recommended. [dustmasker] is
bundled with NCBI BLAST.

Finally, if you want to build your own database, you will need to
install the [Jellyfish] $k$-mer counter. Note that Kraken only
supports use of Jellyfish version 1. Jellyfish version 2 is not
Expand All @@ -86,7 +91,7 @@ read the paragraph about MiniKraken, below.
required for this database is also only 4 GB.

[Jellyfish]: http://www.cbcb.umd.edu/software/jellyfish/

[dustmasker]: https://www.ncbi.nlm.nih.gov/books/NBK279681/

Installation
============
Expand Down Expand Up @@ -459,6 +464,13 @@ To build a custom database:
size (in GB) for the database. This allows you to create a MiniKraken
database without having to create a full Kraken database first.

5) Turning off low-complexity masking: Kraken will by default use the
program `dustmasker` (bundled with NCBI BLAST) if installed and in the
system path to mask low-complexity regions of library sequences,
which is the primary source of false positive hits. If you do not
want the library sequences to be masked, you can provide the
`--no-mask` option to disable masking.

A full list of options for `kraken-build` can be obtained using
`kraken-build --help`.

Expand Down
25 changes: 21 additions & 4 deletions scripts/build_kraken_db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,28 @@ else
echo "Hash size not specified, using '$KRAKEN_HASH_SIZE'"
fi

find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0 | \
xargs -0 cat | \
jellyfish count -m $KRAKEN_KMER_LEN -s $KRAKEN_HASH_SIZE -C -t $KRAKEN_THREAD_CT \
-o database /dev/fd/0
if [ -z "$KRAKEN_NO_MASK" ] && which dustmasker > /dev/null
then

find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0 | \
xargs -0 cat \
| dustmasker -outfmt fasta \
| sed -e '/>/!s/a\|c\|g\|t/N/g' \
| jellyfish count -m $KRAKEN_KMER_LEN -s $KRAKEN_HASH_SIZE -C -t $KRAKEN_THREAD_CT \
-o database /dev/fd/0

else

[ -z "$KRAKEN_NO_MASK" ] \
&& echo "WARNING: dustmasker not found, database will not be masked for low-complexity regions."

find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0 | \
xargs -0 cat \
| jellyfish count -m $KRAKEN_KMER_LEN -s $KRAKEN_HASH_SIZE -C -t $KRAKEN_THREAD_CT \
-o database /dev/fd/0

fi

# Merge only if necessary
if [ -e "database_1" ]
then
Expand Down
68 changes: 36 additions & 32 deletions scripts/download_genomic_library.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,22 @@ set -u # Protect against uninitialized vars.
set -e # Stop on error

LIBRARY_DIR="$KRAKEN_DB_NAME/library"
NCBI_SERVER="ftp.ncbi.nlm.nih.gov"
NCBI_SERVER="ftp.ncbi.nih.gov"
FTP_SERVER="ftp://$NCBI_SERVER"
RSYNC_SERVER="rsync://$NCBI_SERVER"
THIS_DIR=$PWD
EXTENSION=genomic.fna.gz

case "$1" in
"bacteria")
mkdir -p $LIBRARY_DIR/Bacteria
cd $LIBRARY_DIR/Bacteria
if [ ! -e "lib.complete" ]
then
rm -f all.fna.tar.gz
wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz
wget -N $FTP_SERVER/refseq/release/bacteria/*.$EXTENSION
echo -n "Unpacking..."
tar zxf all.fna.tar.gz
rm all.fna.tar.gz
gunzip *.$EXTENSION

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default behavior of gunzip is to remove the compressed file, this causes an error when rm is called after.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SheaML , thanks for your comment.

In the code, I removed the call to 'tar', because we are no longer downloading the all.fna.tar.gz file. Instead, we are downloading a set of gzipped files, as that is how RefSeq is now representing the data. Thus there's no call to 'rm' after a (sub-process) call to 'gunzip'. I think the pull diff shows it pretty clearly in terms of the red lines, which have been removed.

I'm new to GitHub pull requests, so please excuse me if I've confused something.

Copy link

@slambert-git slambert-git Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @taltman the next line you added, line 46, "rm *.$EXTENSION" is the one causing the error for me. Sorry, I'm also a github neophyte.

rm *.$EXTENSION
echo " complete."
touch "lib.complete"
else
Expand All @@ -55,11 +55,10 @@ case "$1" in
cd $LIBRARY_DIR/Plasmids
if [ ! -e "lib.complete" ]
then
rm -f plasmids.all.fna.tar.gz
wget $FTP_SERVER/genomes/Plasmids/plasmids.all.fna.tar.gz
wget -N $FTP_SERVER/refseq/release/plasmid/*.$EXTENSION
echo -n "Unpacking..."
tar zxf plasmids.all.fna.tar.gz
rm plasmids.all.fna.tar.gz
gunzip *.$EXTENSION
rm *.$EXTENSION
echo " complete."
touch "lib.complete"
else
Expand All @@ -71,15 +70,10 @@ case "$1" in
cd $LIBRARY_DIR/Viruses
if [ ! -e "lib.complete" ]
then
rm -f all.fna.tar.gz
rm -f all.ffn.tar.gz
wget $FTP_SERVER/genomes/Viruses/all.fna.tar.gz
wget $FTP_SERVER/genomes/Viruses/all.ffn.tar.gz
wget -N $FTP_SERVER/refseq/release/viral/*.$EXTENSION
echo -n "Unpacking..."
tar zxf all.fna.tar.gz
tar zxf all.ffn.tar.gz
rm all.fna.tar.gz
rm all.ffn.tar.gz
gunzip *.$EXTENSION
rm *.$EXTENSION
echo " complete."
touch "lib.complete"
else
Expand All @@ -91,23 +85,33 @@ case "$1" in
cd $LIBRARY_DIR/Human
if [ ! -e "lib.complete" ]
then
# get list of CHR_* directories
wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/
directories=$(perl -nle '/^d/ and /(CHR_\w+)\s*$/ and print $1' .listing)
rm .listing

# For each CHR_* directory, get GRCh* fasta gzip file name, d/l, unzip, and add
for directory in $directories
do
wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/$directory/
file=$(perl -nle '/^-/ and /\b(hs_ref_GRCh\S+\.fa\.gz)\s*$/ and print $1' .listing)
[ -z "$file" ] && exit 1
rm .listing
wget $FTP_SERVER/genomes/H_sapiens/$directory/$file
gunzip "$file"
done
## Download all files in a single invocation to wget:
wget \
--no-directories \
--recursive \
--level=2 \
--accept "hs_ref_GRCh*.fa.gz" \
$FTP_SERVER/genomes/Homo_sapiens/

touch "lib.complete"
# # get list of CHR_* directories
# wget --spider --no-remove-listing $FTP_SERVER/genomes/Homo_sapiens/
# directories=$(perl -nle '/^d/ and /(CHR_\w+)\s*$/ and print $1' .listing)
# rm .listing

# # For each CHR_* directory, get GRCh* fasta gzip file name, d/l, unzip, and add
# for directory in $directories
# do
# wget --spider --no-remove-listing $FTP_SERVER/genomes/Homo_sapiens/$directory/
# file=$(perl -nle '/^-/ and /\b(hs_ref_GRCh\w+\.fa\.gz)\s*$/ and print $1' .listing)
# [ -z "$file" ] && exit 1
# rm .listing
# wget $FTP_SERVER/genomes/H_sapiens/$directory/$file
# gunzip *.gz
#done

gunzip *.gz
touch "lib.complete"
else
echo "Skipping download of human genome, already downloaded here."
fi
Expand Down
20 changes: 13 additions & 7 deletions scripts/kraken-build
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ my (
$max_db_size,
$work_on_disk,
$shrink_block_offset,
$no_mask,

$dl_taxonomy,
$dl_library,
Expand Down Expand Up @@ -98,6 +99,7 @@ GetOptions(
"max-db-size=s", \$max_db_size,
"work-on-disk", \$work_on_disk,
"shrink-block-offset=i", \$shrink_block_offset,
"no-mask", \$no_mask,

"download-taxonomy" => \$dl_taxonomy,
"download-library=s" => \$dl_library,
Expand Down Expand Up @@ -152,13 +154,14 @@ if ($max_db_size !~ /^$/ && $max_db_size <= 0) {
die "Can't have negative max database size.\n";
}

$ENV{"KRAKEN_DB_NAME"} = $db;
$ENV{"KRAKEN_THREAD_CT"} = $threads;
$ENV{"KRAKEN_DB_NAME"} = $db;
$ENV{"KRAKEN_THREAD_CT"} = $threads;
$ENV{"KRAKEN_MINIMIZER_LEN"} = $minimizer_len;
$ENV{"KRAKEN_KMER_LEN"} = $kmer_len;
$ENV{"KRAKEN_HASH_SIZE"} = $hash_size;
$ENV{"KRAKEN_MAX_DB_SIZE"} = $max_db_size;
$ENV{"KRAKEN_WORK_ON_DISK"} = $work_on_disk;
$ENV{"KRAKEN_KMER_LEN"} = $kmer_len;
$ENV{"KRAKEN_HASH_SIZE"} = $hash_size;
$ENV{"KRAKEN_MAX_DB_SIZE"} = $max_db_size;
$ENV{"KRAKEN_WORK_ON_DISK"} = $work_on_disk;
$ENV{"KRAKEN_NO_MASK"} = $no_mask;

if ($dl_taxonomy) {
download_taxonomy();
Expand Down Expand Up @@ -235,10 +238,13 @@ Options:
(default: 1)
--work-on-disk Perform most operations on disk rather than in
RAM (will slow down build in most cases)
--no-mask Do not use "dustmasker" to mask low-complexity
regions in library sequences.

EOF
exit $exit_code;
}

##'
sub display_help {
usage(0);
}
Expand Down