Skip to content

Commit

Permalink
Revise doc times, prep for 1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
DerrickWood committed Oct 20, 2017
1 parent 22dc088 commit 7a9ace0
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 29 deletions.
27 changes: 13 additions & 14 deletions docs/MANUAL.html
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>

<p class="version">Version 0.10.5-beta</p>
<p class="version">Version 1.0</p>

<p>Operating Manual</p>
</div>
Expand Down Expand Up @@ -43,9 +43,9 @@ <h1 id="introduction">Introduction</h1>
<h1 id="system-requirements">System Requirements</h1>
<p>Note: Users concerned about the disk or memory requirements should read the paragraph about MiniKraken, below.</p>
<ul>
<li><p><strong>Disk space</strong>: Construction of Kraken's standard database will require at least 500 GB of disk space as of Oct. 2017. Customized databases may require more or less space. After construction, the minimum required database files require 250 GB of disk space. Disk space used is linearly proportional to the number of distinct <span class="math inline"><em>k</em></span>-mers; as of Oct. 2017, Kraken's default database contains approximately 34.3 billion (3.4e10) distinct <span class="math inline"><em>k</em></span>-mers.</p>
<li><p><strong>Disk space</strong>: Construction of Kraken's standard database will require at least 500 GB of disk space. Customized databases may require more or less space. Disk space used is linearly proportional to the number of distinct <span class="math inline"><em>k</em></span>-mers; as of Oct. 2017, Kraken's default database contains just over 14.4 billion (1.44e10) distinct <span class="math inline"><em>k</em></span>-mers.</p>
<p>In addition, the disk used to store the database should be locally-attached storage. Storing the database on a network filesystem (NFS) partition can cause Kraken's operation to be very slow, or to be stopped completely. As NFS accesses are much slower than local disk accesses, both preloading and database building will be slowed by use of NFS.</p></li>
<li><p><strong>Memory</strong>: To run efficiently, Kraken requires enough free memory to hold the database in RAM. While this can be accomplished using a ramdisk, Kraken supplies a utility for loading the database into RAM via the OS cache. The default database size is 140 GB (as of Oct. 2017), and so you will need at least that much RAM if you want to build or run with the default database.</p></li>
<li><p><strong>Memory</strong>: To run efficiently, Kraken requires enough free memory to hold the database in RAM. While this can be accomplished using a ramdisk, Kraken supplies a utility for loading the database into RAM via the OS cache. The default database size is 170 GB (as of Oct. 2017), and so you will need at least that much RAM if you want to build or run with the default database.</p></li>
<li><p><strong>Dependencies</strong>: Kraken currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++, and need to be compiled using g++. Multithreading is handled using OpenMP. Downloads of NCBI data are performed by wget and in some cases, by rsync. Most Linux systems that have any sort of development package installed will have all of the above listed programs and libraries available.</p>
<p>Finally, if you want to build your own database, you will need to install the <a href="http://www.cbcb.umd.edu/software/jellyfish/">Jellyfish</a> <span class="math inline"><em>k</em></span>-mer counter. Note that Kraken only supports use of Jellyfish version 1. Jellyfish version 2 is not yet compatible with Kraken.</p></li>
<li><p><strong>Network connectivity</strong>: Kraken's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as <code>ftp_proxy</code> or <code>RSYNC_PROXY</code>) in order to get these commands to work properly.</p></li>
Expand Down Expand Up @@ -74,22 +74,21 @@ <h1 id="kraken-databases">Kraken Databases</h1>
<p>Other files may be present as part of the database build process.</p>
<p>In interacting with Kraken, you should not have to directly reference any of these files, but rather simply provide the name of the directory in which they are stored. Kraken allows both the use of a standard database as well as custom databases; these are described in the sections <a href="#standard-kraken-database">Standard Kraken Database</a> and <a href="#custom-databases">Custom Databases</a> below, respectively.</p>
<h1 id="standard-kraken-database">Standard Kraken Database</h1>
<p>NOTE: Building the standard Kraken database downloads and uses all complete bacterial, archeal, and viral genomes in Refseq at the time of the build. As of October 2017, this includes ~25,000 genomes, requiring 33GB of disk space. The build process will then require approximately 420GB of additional disk space. After building this standard database, usage of the datbaase will require uses to keep only the database.idx, database.kdb, and taxonomy/ files, which requires approximately 240GB of disk space. When running a sample against this database, users will need 115GB of RAM. IF you do not have these computational resources or need to test against the Refseq database of ~20,000 genomes, we recommend building a custom database with only the genomes needed for your application. </p>
<p>To create the standard Kraken database, you can use the following command:</p>
<pre><code>kraken-build --standard --db $DBNAME</code></pre>
<p>(Replace &quot;<code>$DBNAME</code>&quot; above with your preferred database name/location.)</p>
<p>(Replace &quot;<code>$DBNAME</code>&quot; above with your preferred database name/location. Please note that the database will use approximately 500 GB of disk space during creation.)</p>
<p>This will download NCBI taxonomic information, as well as the complete genomes in RefSeq for the bacterial, archaeal, and viral domains. After downloading all this data, the build process begins; this is the most time-consuming step. If you have multiple processing cores, you can run this process with multiple threads, e.g.:</p>
<pre><code>kraken-build --standard --threads 16 --db $DBNAME</code></pre>
<p>Using 16 threads on a computer with 256 GB of RAM and a jellyfish hash-size of 12800M, the build process took approximately 15 hours (steps with an asterisk have some multi-threading enabled) in October 2017. Please note that the time required for building the database depends on the number of genomic sequences:</p>
<pre><code>
1h36m30s *Step 1 (create set)
<pre><code>kraken-build --standard --threads 24 --db $DBNAME</code></pre>
<p>Using 24 threads on a computer (an AWS r4.8xlarge instance) with 244 GB of RAM, the build process took approximately 5 hours (steps with an asterisk have some multi-threading enabled) in October 2017:</p>
<pre><code> 24m50s *Step 1 (create set)
n/a Step 2 (reduce database, optional and skipped)
9h30m13s *Step 3 (sort set)
9m04s Step 4 (GI number to sequence ID map - now obsolete)
1m20s Step 5 (Sequence ID to taxon map)
6h24m20s *Step 6 (set LCA values)
154m53s *Step 3 (sort set)
n/a Step 4 (GI number to sequence ID map - now obsolete)
&lt;1s Step 5 (Sequence ID to taxon map)
127m28s *Step 6 (set LCA values)
-------
17h41m37s Total build time</code></pre>
5h7m11s Total build time</code></pre>
<p>This process used the automatically estimated jellyfish hash size of 20170976000.</p>
<p>Note that if any step (including the initial downloads) fails, the build process will abort. However, <code>kraken-build</code> will produce checkpoints throughout the installation process, and will restart the build at the last incomplete step if you attempt to run the same command again on a partially-built database.</p>
<p>To create a custom database, or to use a database from another source, see <a href="#custom-databases">Custom Databases</a>.</p>
<p>Notes for users with lower amounts of RAM:</p>
Expand Down
30 changes: 17 additions & 13 deletions docs/MANUAL.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ read the paragraph about MiniKraken, below.
least 500 GB of disk space. Customized databases may require
more or less space. Disk space used is linearly proportional
to the number of distinct $k$-mers; as of Oct. 2017, Kraken's
default database contains just under 34.3 billion (3.4e10) distinct $k$-mers.
default database contains just over 14.4 billion (1.44e10)
distinct $k$-mers.

In addition, the disk used to store the database should be
locally-attached storage. Storing the database on a network
Expand All @@ -51,7 +52,7 @@ read the paragraph about MiniKraken, below.
* **Memory**: To run efficiently, Kraken requires enough free memory to
hold the database in RAM. While this can be accomplished using a
ramdisk, Kraken supplies a utility for loading the database into
RAM via the OS cache. The default database size is 140 GB (as of
RAM via the OS cache. The default database size is 170 GB (as of
Oct. 2017), and so you will need at least that much RAM if you want
to build or run with the default database.

Expand Down Expand Up @@ -162,21 +163,24 @@ process begins; this is the most time-consuming step. If you
have multiple processing cores, you can run this process with
multiple threads, e.g.:

kraken-build --standard --threads 16 --db $DBNAME
kraken-build --standard --threads 24 --db $DBNAME

Using 16 threads on a computer with 250 GB of RAM and a jellyfish
hash-size of 12800M, the build process took approximately 15 hours
Using 24 threads on a computer (an AWS r4.8xlarge instance)
with 244 GB of RAM, the build process took approximately 5 hours
(steps with an asterisk have some multi-threading enabled) in
October 2017:

1h36m30s *Step 1 (create set)
n/a Step 2 (reduce database, optional and skipped)
9h30m13s *Step 3 (sort set)
9m25s Step 4 (GI number to sequence ID map - now obsolete)
1m20s Step 5 (Sequence ID to taxon map)
6h24m20s *Step 6 (set LCA values)
--------
17h41m37s Total build time
24m50s *Step 1 (create set)
n/a Step 2 (reduce database, optional and skipped)
154m53s *Step 3 (sort set)
n/a Step 4 (GI number to sequence ID map - now obsolete)
<1s Step 5 (Sequence ID to taxon map)
127m28s *Step 6 (set LCA values)
-------
5h7m11s Total build time

This process used the automatically estimated jellyfish hash size
of 20170976000.

Note that if any step (including the initial downloads) fails,
the build process will abort. However, `kraken-build` will
Expand Down
2 changes: 1 addition & 1 deletion docs/top.html
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>

<p class="version">Version 0.10.5-beta</p>
<p class="version">Version 1.0</p>

<p>Operating Manual</p>
</div>
Expand Down
2 changes: 1 addition & 1 deletion install_kraken.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

set -e

VERSION="0.10.6-unreleased"
VERSION="1.0"

if [ -z "$1" ] || [ -n "$2" ]
then
Expand Down

0 comments on commit 7a9ace0

Please sign in to comment.