Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Assembly_fetch] Add support for downloading from NCBI Virus database #607

Open
kapsakcj opened this issue Sep 6, 2024 · 1 comment
Open
Labels
enhancement This issue is a new feature or request

Comments

@kapsakcj
Copy link
Contributor

kapsakcj commented Sep 6, 2024

🆒

📌 Explain the Request

Some GenBank accessions are unable to be downloaded via the command we currently have in the Assembly_fetch workflow:

datasets download genome accession ~{ncbi_accession}

In code here: https://github.com/theiagen/public_health_bioinformatics/blob/5be343354f716d77e9e4a0fb4a2ec10eb3bc00a5/tasks/utilities/data_import/task_ncbi_datasets.wdl#L27C5-L28C24

For example, with this accession, OM900516.2, it fails with this message:

$ datasets download genome accession OM900516.2  --filename OM900516.zip  --assembly-version latest   --include genome
Error: invalid or unsupported assembly accession: OM900516

Use datasets download genome accession <command> --help for detailed help about a command.

The reason being is that these kinds of accession are only accessible through the NCBI Virus data package, so you have to specify a different sub-command to download the genome (& other associated files)

This command works:

$ datasets download virus genome accession OM900516.2  --filename OM900516.2.zip --include genome
New version of client (16.27.2) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
Downloading: OM900516.2.zip    15.1kB valid zip structure -- files not checked
Validating package [================================================] 100% 5/5

I've started a dev branch called cjk-assembly-fetch for this a long time ago but it was left by the wayside as other higher priorities arose.

It would be good to continue making commits to this branch and add in support more completely. Things that need to be done:

  • update the docker image to the latest version of NCBI datasets CLI tool. StaPH-B has one that's a little more up-to-date though not likely the absolute latest version
  • clean up output file handling and ensure that regular downloads without use of datasets download virus subcommand are unimpacted by changes
  • there may be other CLI features we want to make accessible to the user. Would be good to do a review of their documentation prior to further code dev.
  • test on a variety of accessions both virus and non-virus to ensure functionality.
@kapsakcj kapsakcj added the enhancement This issue is a new feature or request label Sep 6, 2024
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Sep 6, 2024

@emily-smith1 had success with the dev branch as it stands today: https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/dataAnalysis_SARS-CoV-2_CA-CDC/job_history/6cd834cc-aa53-4d40-8a27-4554edcbae7b

I'm glad we didn't delete this branch! * pats self on back * 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue is a new feature or request
Projects
None yet
Development

No branches or pull requests

1 participant