diff --git a/docs/MANUAL.html b/docs/MANUAL.html index ace8bdc..6829d86 100644 --- a/docs/MANUAL.html +++ b/docs/MANUAL.html @@ -11,11 +11,11 @@
Kraken taxonomic sequence classification system
+Kraken taxonomic sequence classification system
-Version 0.10.5-beta
+Version 0.10.5-beta
-Operating Manual
+Operating Manual
Disk space: Construction of Kraken's standard database will require at least 160 GB of disk space. Customized databases may require more or less space. Disk space used is linearly proportional to the number of distinct k-mers; as of Feb. 2015, Kraken's default database contains just under 6 billion (6e9) distinct k-mers.
In addition, the disk used to store the database should be locally-attached storage. Storing the database on a network filesystem (NFS) partition can cause Kraken's operation to be very slow, or to be stopped completely. As NFS accesses are much slower than local disk accesses, both preloading and database building will be slowed by use of NFS.
Memory: To run efficiently, Kraken requires enough free memory to hold the database in RAM. While this can be accomplished using a ramdisk, Kraken supplies a utility for loading the database into RAM via the OS cache. The default database size is 75 GB (as of Feb. 2015), and so you will need at least that much RAM if you want to build or run with the default database.
Dependencies: Kraken currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++, and need to be compiled using g++. Downloads of NCBI data are performed by wget and in some cases, by rsync.
+Dependencies: Kraken currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++, and need to be compiled using g++. Multithreading is handled using OpenMP. Downloads of NCBI data are performed by wget and in some cases, by rsync. Most Linux systems that have any sort of development package installed will have all of the above listed programs and libraries available.
Finally, if you want to build your own database, you will need to install the Jellyfish k-mer counter. Note that Kraken only supports use of Jellyfish version 1. Jellyfish version 2 is not yet compatible with Kraken.
Network connectivity: Kraken's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as ftp_proxy
or RSYNC_PROXY
) in order to get these commands to work properly.
MiniKraken: To allow users with low-memory computing environments to use Kraken, we supply a reduced standard database that can be downloaded from the Kraken web site. When Kraken is run with a reduced database, we call it MiniKraken.
@@ -222,7 +222,14 @@At present, we have not yet developed a confidence score with a solid probabilistic interpretation for Kraken. However, we have developed a simple scoring scheme that has yielded good results for us, and we've made that available in the kraken-filter
script. The approach we use allows a user to specify a threshold score in the [0,1] interval; the kraken-filter
script then will adjust labels up the tree until the label's score (described below) meets or exceeds that threshold. If a label at the root of the taxonomic tree would not have a score exceeding the threshold, the sequence is called unclassified by kraken-filter.
A sequence label's score is a fraction C/Q, where C is the number of k-mers mapped to LCA values in the clade rooted at the label, and Q is the number of k-mers in the sequence that lack an ambiguous nucleotide (i.e., they were queried against the database). Consider the example of the LCA mappings in Kraken's output given earlier:
-"562:13 561:4 A:31 0:1 562:3" would indicate that: * the first 13 k-mers mapped to taxonomy ID #562 * the next 4 k-mers mapped to taxonomy ID #561 * the next 31 k-mers contained an ambiguous nucleotide * the next k-mer was not in the database * the last 3 k-mers mapped to taxonomy ID #562
+"562:13 561:4 A:31 0:1 562:3" would indicate that:
+In this case, ID #561 is the parent node of #562. Here, a label of #562 for this sequence would have a score of C/Q = (13+3)/(13+4+1+3) = 16/21. A label of #561 would have a score of C/Q = (13+4+3)/(13+4+1+3) = 20/21. If a user specified a threshold over 16/21, kraken-filter would adjust the original label from #562 to #561; if the threshold was greater than 20/21, the sequence would become unclassified.
kraken-filter
is used like this:
kraken-filter --db $DBNAME [--threshold NUM] kraken.output
diff --git a/docs/MANUAL.markdown b/docs/MANUAL.markdown
index 5546bd0..903a1ea 100644
--- a/docs/MANUAL.markdown
+++ b/docs/MANUAL.markdown
@@ -587,6 +587,7 @@ they were queried against the database). Consider the example of the
LCA mappings in Kraken's output given earlier:
"562:13 561:4 A:31 0:1 562:3" would indicate that:
+
* the first 13 $k$-mers mapped to taxonomy ID #562
* the next 4 $k$-mers mapped to taxonomy ID #561
* the next 31 $k$-mers contained an ambiguous nucleotide