kraken2-build FTP connection error (timeout) #272

Nick243 · 2020-06-25T13:17:57Z

Hello,

I am attempting to build a custom (standard plus fungi) kraken2 database and keep getting a timeout error. I wanted to ask if others have seen this before and had any suggestions?

I look to have been able to successfully install the NCBI taxonomy with:
module load kraken/2.0.8
kraken2-build --download-taxonomy --db /users/olljt2/kraken/db/. --threads 24 --use-ftp

However, when I try to install a database with:
kraken2-build --download-library bacteria --db /users/olljt2/kraken/db/. --threads 24 --use-ftp

I get the following error:
Step 1/2: Performing ftp file transfer of requested files
rsync_from_ncbi.pl: FTP connection error: Net::FTP: connect: timeout

Same thing happens when our cluster administrator attempts to install the files. I tried this on a few different occasions thinking/hoping maybe it was an issue on the NCBI side. I also tried with multiple different databases. Each time I get the same error. The files look to start to download, but do not seem to finish.

The full code used was:
module load kraken/2.0.8

kraken2-build --download-taxonomy --db /users/olljt2/kraken/db/. --threads 24 --use-ftp

kraken2-build --download-library bacteria --db /users/olljt2/kraken/db/. --threads 24 --use-ftp
kraken2-build --download-library archaea --db /users/olljt2/kraken/db/. --threads 24 --use-ftp
kraken2-build --download-library viral --db /users/olljt2/kraken/db/. --threads 24 --use-ftp
kraken2-build --download-library fungi --db /users/olljt2/kraken/db/. --threads 24 --use-ftp
kraken2-build --download-library human --db /users/olljt2/kraken/db/. --threads 24 --use-ftp
kraken2-build --download-library UniVec_Core --db /users/olljt2/kraken/db/. --threads 24 --use-ftp

kraken2-build --build --db /users/olljt2/kraken/db/. --threads 24

The --build command returned:
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.062s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 981576 bytes
Capacity estimation complete. [0.095s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 2 bits reserved for taxid.
Completed processing of 3137 sequences, 687518 bp
Writing data to disk... complete.
Database files completed. [7.798s]
Database construction complete. [Total: 7.994s]

Which looks to be just a few sequences.

Any thoughts or suggestions would be greatly appreciated!

Thanks in advance,

Nick

R-Wright-1 · 2020-06-30T22:44:46Z

Hi,

I just wanted to add to this that I've been having similar issues. I have also been able to download the taxonomy fine, as well as the non-redundant protein database, but have problems with getting the standard database, and have also tried getting bacteria only with no luck.

We had thought it might be firewall issues on the server, but I've now also tried it on a different server, my own laptop and someone elses desktop, and each have the same error:
rsync_from_ncbi.pl: unexpected FTP path (new server?) for na

I also tried the amendments to the rsync_from_ncbi.pl script suggested here as well as here, and still haven't been able to get it to work. I assume that it's not actually an issue with the ftp path as this works for the non-redundant database, and these amendments should be skipping any NAs.

If anyone has any suggestions then I would appreciate that, too!

Thanks,
Robyn

R-Wright-1 · 2020-07-07T22:45:51Z

Hi again,

Just wanted to update in case this is helpful to anyone else (and maybe this will help @Nick243). I have found that the scripts given in this repository work for downloading the databases (in my case slightly edited to download protein rather than genomic sequences), and then adding these to the library and building the database using the regular kraken2 scripts worked fine.

Edited to say that this was working, and I built a database with over 1000 bacterial genomes, but at some point it didn't work. I don't know any perl, so wrote an almost equivalent python script that you can give options for all domains (including an extra option for only human), and also for either DNA or protein sequences. It will also check whether you have already downloaded a sequence, so if it got stopped for any reason then you wouldn't need to re-run, and it will give you a text file out at the end that will tell you about any problems that it had while downloading. It's here in case anyone is interested.

Thanks,
Robyn

Nick243 · 2020-07-21T20:52:42Z

Hi Robyn,

Thanks so much for sharing your script. This is extremely helpful!

I was able to access and pull down the files. I am looking to download the complete refseq genomes (dna sequences) for bacteria, archaea, fungi, virus, and human. I was hoping to confirm I implemented this correctly using your program.

I ran:
module load python3/3.6.3
python download_domain.py --domain bacteria --complete True --ext dna
python download_domain.py --domain archaea --complete True --ext dna
python download_domain.py --domain viral --complete True --ext dna
python download_domain.py --domain fungi --complete True --ext dna
python download_domain.py --domain vertebrate_mammalian --complete True --ext dna --human True

This looks to have worked beautifully. Before trying to build the Kraken2 database; however, I was hoping I could ask:

When pulling down only the human only file should I be specifying --human True or --Human False?
The above code pulled down the vertebrate_mammalian assembly_summary.txt file, but no .txt.fna file for the GRCh38.p13 reference. Should there be a .fna file as well?
The above code pulled down only 12 fungal .fna files. Would this be roughly inline with your expectations?

Thanks again for sharing the script and in advance for any thoughts!

Nick

R-Wright-1 · 2020-07-21T21:07:43Z

Hi Nick,

No worries! Glad someone else can make use of it. And yes that looks correct.

I just had a quick look, and it looks like the issue is that the reference human genome is a 'Chromosome' rather than 'Complete' - I've added a bit to the script so that it ignores the Complete part if downloading the human genome, but just running with --complete False would have the same effect.

Robyn

jenniferlu717 · 2020-09-10T02:21:44Z

@Nick243 apologies for the late response. Issues with downloading are harder to debug as they vary from system to system, but essentially your server is having trouble connecting with NCBI. You can try downloading the files without the --use-ftp switch and see if that works any better.

@R-Wright-1 We have updated the code to fix the na error. but yes, the default downloads only look for complete/chromosome level assemblies. You can modify the rsync_from_ncbi.pl script to include other assembly levels (Line 40 - add in Contig/Scaffold) However, draft genomes are much more likely to have contamination that may skew the results.

As this issue is a few months old, I'm going to close it for now. If you continue to have problems with the newest code update, please open a new issue.
Thanks,
Jen Lu

jenniferlu717 closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kraken2-build FTP connection error (timeout) #272

kraken2-build FTP connection error (timeout) #272

Nick243 commented Jun 25, 2020 •

edited

Loading

R-Wright-1 commented Jun 30, 2020

R-Wright-1 commented Jul 7, 2020 •

edited

Loading

Nick243 commented Jul 21, 2020

R-Wright-1 commented Jul 21, 2020

jenniferlu717 commented Sep 10, 2020

kraken2-build FTP connection error (timeout) #272

kraken2-build FTP connection error (timeout) #272

Comments

Nick243 commented Jun 25, 2020 • edited Loading

R-Wright-1 commented Jun 30, 2020

R-Wright-1 commented Jul 7, 2020 • edited Loading

Nick243 commented Jul 21, 2020

R-Wright-1 commented Jul 21, 2020

jenniferlu717 commented Sep 10, 2020

Nick243 commented Jun 25, 2020 •

edited

Loading

R-Wright-1 commented Jul 7, 2020 •

edited

Loading