Releases: theiagen/public_health_viral_genomics
v2.3.2
PHVG v2.3.2 Patch Release
This patch release updates the Mercury workflows and adds a new output variable ivar_variant_proportion_intermediate
Mercury patches
This release adds the "covv_consortium" column to the output GISAID metadata file in the Mercury workflows. This new optional column has been added to the metadata formatters, which can be found here: Mercury_PE/_SE_Prep at gs://theiagen-public-files/terra/mercury-files/Terra_Metadata_Formatter_2023_05_22.xlsx
, and Mercury_Prep_N_Batch at gs://theiagen-public-files/terra/mercury-files/Mercury_Prep_N_Batch_SC2_Metadata_Formatter_2023_05_22.xlsx
.
Also, empty date values will now fail more informatively in Mercury_Prep_N_Batch.
New output variable
The variant_call
task has been modified to now calculate the proportion of variants at intermediate allele frequencies (60-90%). This value is reported in the output column ivar_variant_proportion_intermediate
for workflows that use iVar to perform variant calling (TheiaCoV_Illumina_PE and TheiaCoV_Illumina_SE).
What's Changed
- Update README.md by @kevinlibuit in #218
- Add "consortium" to Mercury_Prep by @sage-wright in #222
- Add intermediate frequency mutations screen TheiaCoV by @michellescribner in #220
- Mercury Fix when collection_date is missing by @sage-wright in #219
- update checksums by @sage-wright in #223
Full Changelog: v2.3.1...v2.3.2
Follow us on Twitter!
v2.3.1
PHVG v2.3.1 release notes
This patch release adds capability for detection of mutations known to be associated with Tamiflu resistance, includes bug fixes for Influenza Type B subtyping, and updates default input parameters (pangolin docker image, nextclade_dataset_tag, nextclade docker image).
New Features
- New column
tamiflu_resistance_aa_subs
containing nextclade-detected substitutions that have been described in the literature to confer resistance to tamiflu (Influenza-specific) - New optional boolean input parameters for Mercury_Prep_N_Batch:
using_clearlabs_data
,using_reads_dehosted
,usa_territory
- New optional input parameter for Freyja_Plot workflow:
mincov
Default Docker Images and Input Parameter Updates
- Default pangolin docker image:
staphb/pangolin:4.2-pdata-1.18.1.1
- Default nextclade docker image:
nextstrain/nextclade:2.11.0
- Default nextclade_dataset_tag for SARS-CoV-2:
2023-02-25T12:00:00Z
- Default freyja docker image:
staphb/freyja:1.3.11
Other Changes
- Bug fix: Type B Influenza subtypes no longer duplicated from ABRicate output
- Updates to GitHub Actions workflows for automated testing
Documentation can be found here: https://theiagen.notion.site/Theiagen-Public-Health-Resources-a4bd134b0c5c4fe39870e21029a30566
What's Changed
- Expose minimum coverage option in Freyja_Plot by @sage-wright in #211
- Enable alternative read and assembly files by @sage-wright in #210
- Fix Bug RE Type-B subtyping (TheiaCoV_Illumina PE flu track) by @kevinlibuit in #216
- update nextclade TSV parsing for SC2:
clade_legacy
. Also update Flu & nextclade by @kapsakcj and @cimendes in #213 - update default pangolin docker to staphb/pangolin:4.2-pdata-1.18.1.1 and nextclade_dataset_tag for SC2 by @kapsakcj in #217
Full Changelog: v2.3.0...v2.3.1
v2.3.0
PHVG v2.3.0 Release Notes
This minor release introduces updates organism updates for the TheiaCoV workflow series as well as a new workflow for preparing and submitting metadata to public repositories (Mercury_Prep_N_Batch).
Updates to the TheiaCoV Workflow Series
Organism track updates:
- “MPXV” for monkeypox analysis: VADR annotation assessment enabled (was previously not supported)
- "WNV" for West Nile Virus analysis: VADR annotation assessment enabled (was previously not supported)
- "flu" for influenza analysis: will initiate genome assembly with IRMA and characterization with ABRicate against InsaFlu database and NextClade; available in TheiaCoV_Illumina_PE only
- "HIV" for Human Immunodeficiency Virus analysis: will initiate consensus assembly by alignment (BWA + iVar or minimap2 + Medaka for Illumina and ONT read data, respectively) and characterization with Quasitools HyDRA for antiretroviral drug resistance detection
Note: The default value for the organism
variable is “sars-cov-2”
QC and read processing modules updates:
- Option to utilize fastp rather than trimmomatic for read processing
- Reads processed by BBduk ordered reads help to ensure that downstream alignments are consistent
Mercury Prep-N-Batch Workflow
The Mercury_Prep_N_Batch workflow combines the previously separate Mercury_PE/SE_Prep and Mercury_Batch workflows into one.
This workflow functions as follows:
Step 1: Performs supermassive metadata wrangling (task sm_metadata_wrangling in task_mercury_file_wrangling)
- downloads the entire origin Terra table where the data, analysis results, metadata, etc. are stored.
- extracts the samples that the user intends to upload
- creates some standard variables that are used multiple times (such as year, isolate, etc.)
- determines which organism is being run (currently only supports sars-cov-2 and mpox) and sets the required and optional variables for each file that is being created (e.g., BioSample vs SRA vs GISAID vs GenBank/BankIt)
- removes any entries that do not meet predetermined quality thresholds (
vadr_num_alerts
andnumber_N
) - removes any entries that do not have all required fields present, and writes the samples that were removed to a table that also lists what fields were missing
- renames columns as appropriate
- reformats columns as appropriate
- compiles all required and optional information in TSV files
- renames files with the submission_id and edits fasta headers as appropriate
- uploads read files to the Theiagen SRA GCP Google bucket
Step 2: If sars-cov-2, trim GenBank fasta files of terminal Ns (task trim_genbank_fastas in task_mercury_file_wrangling.wdl)
- uses VADR to trim terminal ambiguous nucleotides
- returns the edited fasta file
Step 3: If mpox, put metadata into sqn format (task table2asn in task_mercury_file_wrangling.wdl)
- soft links the .sbt, .fsa, and .src files to have common name
- converts the data into a sqn file with table2asn so it can be emailed to NCBI
New Documentation
Detailed documentation has been created for all workflows in the PHVG v2.3.0 repository.
What's Changed
- citation.cff update by @kapsakcj in #172
- New VADR output:
.zip
of output fasta files by @kapsakcj in #171 - VADR updates for MPXV; update default nextclade_dataset_tag and docker by @kapsakcj in #175
- Add optional arguments input to trimmomatic task and add fastp task by @michellescribner in #182
- Rp3 add support for adapter files in bbduk, update ci test by @kapsakcj in #186
- adds support for running VADR on WNV samples by @kapsakcj in #190
- Azure compatibility by @sage-wright in #193
- Adding flu organism track by @kevinlibuit in #194
- Fja hiv merge dev by @frankambrosio3 in #198
- the Mercury_Prep_N_Batch workflow by @sage-wright in #196
- Fix conditional logic in Mercury Prep N Batch by @sage-wright in #199
- Fix lowercase things by @sage-wright in #201
- Ensure bbduk outputs are ordered by @kevinlibuit in #202
- Update version and SC2 references by @kevinlibuit in #203
- quality exclusion write out by @sage-wright in #204
- Smw excluded mercury dev by @sage-wright in #205
New Contributors
- @michellescribner made their first contribution in #182
Full Changelog: v2.2.0...v2.3.0
v2.2.0
This release introduces TheiaCoV amenability to non-SARS-CoV-2 (e.g., MPXV) genomic characterization.
NOTE: Use of TheiaCoV for MPXV will require modified input variables; e.g., primer_bed
and reference_genome
. Please view our public Notion page for information on recommended input variables for MPXV genomic characterization.
Use of TheiaCoV for SARS-CoV-2 will not require any change to input variables; i.e., SARS-CoV-2 characterization is the default behavior of the TheiaCoV workflows. Please view our public Notion page to find the latest recommended workspace data elements for SARS-CoV-2 genomic characterization.
TheiaCoV amenability to non-SARS-CoV-2 genomic characterization
- An
organism
variable has been implemented to indicate what organism you want to analyze. This is intended to allow for expansion of the workflow to other viruses not currently supported in the future.- The default value is “sars-cov-2”
- Change to “MPXV” for monkeypox analysis
- A new Boolean variable
trim_primers
indicates whether or not you want to trim primers. This is most applicable when analyzing data generated without primers; e.g., a metagenomic approach. Because of this change, theprimer_bed
variable is now optional and no longer will appear in the same location on the workflow input page. You must indicate a primer_bed file in order to trim primers. When you switch to this new version, the primer file will be inherited to the correct place so no change is required for SARS-CoV-2 users.- The default value is true; primer trimming will occur unless indicated otherwise.
- SC2-specific calculations have been moved to a new task so these calculations are performed only on SC2 samples, and output variables such as
s_gene_percent_coverage
are now prefaced bysc2_
, for examplesc2_s_gene_percent_coverage
, in order to indicate this variable is specific for SC2. - VADR is only performed on SC2 samples.
- VADR is able to be run on MPXV samples but this release does not support this. Future releases will enable this feature.
- Kraken2 has a new input variable
target_org
that enables the user to specify a target organism to pull from the Kraken2 report; e.g., if this value is set to "Monkeypox virus", thekraken_target_org
percentage will populate with the percentage of MPXV identified in the sample.
New features
- Updated documentation is now available on our readthedocs page
- Pangolin:
- A new
pango_lineage_expanded
output variable has been created that is enabled by default through theexpanded_lineage
Boolean input variable. This output lists the pangolin lineage without any aliases (e.g., BA.5 → B.1.1.529.5) --skip-scorpio
and--skip-designation-cache
are now Boolean inputs that are defaulted to false.
- A new
- Freyja:
- Two new workflows have been added: Freyja_Update, a workflow to create updated Freyja reference materials, and Freyja_Dash, a workflow to create an interactive HMTL visualization of aggregated Freyja demixed output
- The docker image has been updated to v1.3.10 for all Freyja tasks.
- New boolean inputs have been created to enable bootstrapping (
bootstrap
; default=false) and use of confirmed lineages only (confirmed_only
; default=false) - A new integer input indicating the number of bootstraps is only used when
bootstrap
is true (number_bootstraps
) - NOTE: Use of a dashboard configuration file is recommended for the Freyja_Dash workflow to create lineage groups and avoid “too many lineages” error messages. An example configuration file can be found here.
- Nextclade:
- The Nextclade task has been modified to be compatible with versions ≥v2.0.0.
- The default dataset tag has been updated to
2022-07-26T12:00:00Z
- The default docker image has been updated to
nextstrain/nextclade:2.4.0
- NOTE: In order to incorporate Nextclade v2.0.0, modifications were made that render our SARS-CoV-2 genomics characterization workflows (e.g., TheiaCoV_Illumina_PE) incompatible with older versions of Nextclade.
What's Changed
- Update TheiaCoV workflows to utilize nextclade v2 by @kevinlibuit in #156
- PHVG Read The Docs update by @emmadoughty and @michellescribner in #154
- Add expanded-lineage output to pangolin4 task and associated workflows by @kevinlibuit in #157
- Reorganize PHVG for MPXV by @sage-wright in #159
- Freyja Updates by @kevinlibuit in #160
- Update versions by @sage-wright in #161
- Adds expanded to update by @sage-wright in #162
- add miniwdl check workflow by @rpetit3 in #158
- Capture Freyja Versions by @kevinlibuit in #164
- Capture reads from alignment by @kevinlibuit in #165
- fix genome length calc by @sage-wright in #166
New Contributors
- @emmadoughty and @michellescribner made their first contributions in #154
Full Changelog: v2.1.2...v2.2.0
v2.1.2
This patch release addresses an issue identified with the TheiaCoV_Augur_Prep workflow
- Overambitious attempt at syntax standardization introduced a bug where wdl variables were not being written to TheiaCoV_Augur_Prep output metadata files; the syntax is now standardized and the bug is now squashed. 🐛👢
Other modifications
- Updated default
pangolin_docker_image
(staphb/pangolin:4.0.6-pdata-1.8) - Updated default
nextclade_dataset_tag
(2022-04-28T12:00:00Z)
What's Changed
- Fix PHVG v2.1.1 bug and update default images and tags by @sage-wright in #141
Full Changelog: v2.1.1...v2.1.2
v2.1.1
This patch release addresses issues identified with the TheiaCoV_Augur_Run workflows
- CSV elements in metadata_merged now properly converted into CSV format
- Multiple TheiaCoV_Augur_Run tasks modified to allow for graceful memory telemetry failure, described by @dpark01 here
Other Modifications:
- Addition of the
pangolin_arguments
variable allows for additional user-defined arguments; e.g.,--skip-scorpio
What's Changed
- Smw fix mem dev by @sage-wright in #137
- Enables --skip-scorpio functionality by @sage-wright in #139
- Updated version by @sage-wright in #140
Full Changelog: v2.1.0...v2.1.1
v2.1.0
This minor release modifies the pangolin task to ensure compatibility with Pangolin ≥v4.0.4
NOTE: In order to incorporate Pangolin ≥v4.0.4, modifications were made that render our SARS-CoV-2 genomics characterization workflows (e.g. TheiaCoV_Illumina_PE) incompatible with older versions of Pangolin.
- Default docker image for
pangolin4
task set to:quay.io/staphb/pangolin:4.0.4-pdata-1.2.133
Other Modifications:
New Features
- An
s_gene_percent_coverage
calculation was added to all Theia_COV workflows for SARS-CoV-2 genomic characterization that incorporate an alignment step (TheiaCoV_ClearLabs, TheiaCoV_Illumina_PE, TheiaCoV_Illumina_SE, and TheiaCoV_ONT).- An additional TSV file is made that includes the percent coverage of all genes in SC2 genomes, assuming Wuhan-1 reference genome positions. It can be found under this column:
percent_gene_coverage
- An additional TSV file is made that includes the percent coverage of all genes in SC2 genomes, assuming Wuhan-1 reference genome positions. It can be found under this column:
- A
min_depth
input variable was created for TheiaCoV_Illumina_PE and TheiaCoV_Illumina_SE workflows to specify the minimum depth of coverage required to call a base in the final assembly output and a variant in the VCF output.- The default value for
min_depth
is 100. - This parameter replaces
min_depth
parameter for two previous tasksconsensus
andvariant_call
. These variables have been consolidated.
- The default value for
- The NextClade dataset tag used is now an output value generated in our SARS-CoV-2 genomics characterization workflows (e.g. TheiaCoV_Illumina_PE) under column:
nextclade_ds_tag
. - The TheiaCoV_Augur_Run
merged_metadata
output file is now in CSV format to be compatible with both Auspice and MicrobeTrace.
Default Docker Image Updates
- Default Nextclade docker image updated to:
nextstrain/nextclade:1.11.0
- Default
nextclade_dataset_tag
updated to:2022-03-31T12:00:00Z
- Default Freyja docker image updated to:
quay.io/staphb/freyja:1.3.2
Bug Fixes
- The output of several Mercury files were called CSV files when they were actually TSV files. This is fixed. #112
Pull Requests and Resolved Issues
- added sed line (#115) by @sage-wright in #118
- update docs by @kevinlibuit in #120
- Percent gene coverage calculations by @sage-wright in #126
- converted derived_cols.tsv to a csv file by @sage-wright in #125
- Various patches by @kevinlibuit in #127
- pangolin v4 & updating nextclade defaults across 6 workflows by @kapsakcj in #128
- Update to Freyja v1.3.4 by @kevinlibuit in #130
- Cjk pangolin v4 dev by @kapsakcj in #131
- update default pangolin docker image to 4.0.4 by @kapsakcj in #132
- Update task_versioning.wdl by @kevinlibuit in #134
Full Changelog: v2.0.0...v2.1.0
v2.0.0
This major release renames workflows to utilize the TheiaCoV tag (previously Titan) and adds five new workflows for public health viral genomics.
Workflow names changed and modifications made:
- Titan_Augur_Prep → TheiaCoV_Augur_Prep
- Titan_Augur_Run → TheiaCoV_Augur_Run
- Allow subsampling via user-defined builds.yml file
- Update default nextstrain docker images (
nextstrain/base:build-20210127T135203Z
→nextstrain/base:build-20210218T081251
)
- Titan_ClearLabs
- Update default consensus task docker container image (
quay.io/staphb/artic-ncov2019:1.3.0
→quay.io/staphb/artic-ncov2019:1.3.0-medaka-1.4.3
)- Note:
quay.io/staphb/artic-ncov2019:1.3.0
&quay.io/staphb/artic-ncov2019-epi2me
are both compatible alternative docker images
- Note:
- Use of
fastq-scan
rather thanfastqc
to calculate number of reads and pairs - Allow for use of a user-defined reference genome for consensus genome assembly
reference_genome
consensus task input variable
- Update default consensus task docker container image (
- Titan_Illumina_PE → TheiaCoV_Illumina_PE
- Default minimum coverage changed from 20x to 100x (
ivar consensus
andivar variants
tasks) - Use of
fastq-scan
rather thanfastqc
to calculate number of reads and pairs - Allow for use of a user-defined reference genome for consensus genome assembly
reference_genome
workflow input variable
- Default minimum coverage changed from 20x to 100x (
- Titan_Illumina_SE → TheiaCoV_Illumina_SE
- Default minimum coverage changed from 20x to 100x (
ivar consensus
andivar variants
tasks) - Use of
fastq-scan
rather thanfastqc
to calculate number of reads and pairs - Allow for use of a user-defined reference genome for consensus genome assembly
reference_genome
workflow input variable
- Default minimum coverage changed from 20x to 100x (
- Titan_ONT → TheiaCoV_ONT
- Update default consensus task docker container image (
quay.io/staphb/artic-ncov2019:1.3.0-medaka-1.4.3
→quay.io/staphb/artic-ncov2019-epi2me
)- Note:
quay.io/staphb/artic-ncov2019:1.3.0
&quay.io/staphb/artic-ncov2019:1.3.0-medaka-1.4.3
are both compatible alternative docker images
- Note:
- Use of
fastq-scan
rather thanfastqc
to calculate number of reads and pairs - Allow for use of a user-defined reference genome for consensus genome assembly
reference_genome
consensus task input variable
- Update default consensus task docker container image (
- Titan_FASTA → TheiaCoV_FASTA
- Titan-GC → TheiaCoV-GC
Workflows Added:
- TheiaCoV_Validate
- Workflow that allows for the rapid comparison of critical output values generated by differing versions of TheiaCoV workflows for SARS-CoV-2 genomic characterization for bioinformatics validation purposes
- TheiaCoV_DistanceTree
- Workflow that allows for Augur distance trees to be generated without refinement
- Workflows for SARS-CoV-2 Wastewater Data Analysis
- Freyja_FASTQ
- Workflow that allows running of the Freyja software with raw paired-end fastq files
- This workflow will generate the required alignment that is used as input to the
freya variants
command that is then analyzed withfreyja demix
- This workflow will generate the required alignment that is used as input to the
- Workflow that allows running of the Freyja software with raw paired-end fastq files
- Freyja_Plot
- Workflow to visualize Freyja outputs using the
freyja plot
command
- Workflow to visualize Freyja outputs using the
- TheiaCoV_WWVC
- Workflow for waste water variant calling that incorporates a modified version of the CDPHE's WasteWaterVariantCalling WDL Worfklow
- Freyja_FASTQ
Other modifications:
- Default docker images updated for Pangolin (
staphb/pangolin:3.1.11-pangolearn-2021-08-24
→quay.io/staphb/3.1.20-pangolearn-2022-02-02
), VADR (staphb/vadr:1.3
→quay.io/staphb/1.4.1-models-1.3-2
) and Nextclade (nextstrain/nextclade:1.3.0
→nextstrain/nextclade:1.10.3
) and Nextclade dataset tag (2021-06-25T00:00:00Z
→2022-02-07T12:00:00Z
) in all TheiaCOV workflows for SARS-CoV-2 genomic characterization (TheiaCoV_ClearLabs, TheiaCoV_FASTA, TheiaCoV_Illumina_PE, TheiaCoV_Illumina_SE, and TheiaCoV_ONT)- NOTE: In order to incorporate Nextclade ≥v1.10.0, modifications to the
nextclade_one_sample
were made that render it incompatible with older versions of Nextclade.
- NOTE: In order to incorporate Nextclade ≥v1.10.0, modifications to the
- Inclusion of S-gene coverage calculation in all Theia_COV workflows for SARS-CoV-2 genomic characterization that incorporate an alignment step (TheiaCoV_ClearLabs, TheiaCoV_Illumina_PE, TheiaCoV_Illumina_SE, and TheiaCoV_ONT)
- Mercury_Batch requiring
Array[String]
(i.e. gcp_uri) forsra_reads
input (wasArray[File]
); this change avoids the need for localization into VM before transferring to transfer bucket for SRA read submission drastically decreasing runtime- This modifications means that a zipped file of reads for web portal submission is no longer produced if a gcp_bucket is not specified; instead, users are encouraged to utilize the
zip_column_content
workflow from the Theiagen Terra_Utilities repository to generate these files.
- This modifications means that a zipped file of reads for web portal submission is no longer produced if a gcp_bucket is not specified; instead, users are encouraged to utilize the
- Implementation of a repository style guide
v1.5.3
Patch to address vulnerability in Mercury Prep workflows to the inadvertent removal of internal Ns when preparing assemblies for GenBank submission
This patch replaces the sed
one liner that removed leading N's from assembly files in preparation for GenBank submission with the NCBI fasta-trim-terminal-ambigs.pl script as the sed
solution was found to be vulnerable to inadvertent removal of non-terminal Ns in multi-line assembly files.
Other modifications made
- NextClade default image updated to v1.3.0;
nextclade_one_sample
task modified to accommodate changes in sourcing reference files - GISAID metadata
passage_history
field auto-populated asoriginal
in the Mercury Prep workflows--other required fields (patient_age
,patient_gender
, andpatient_status
) populated asunknown
if no input value is provided
v1.5.2
Minor release to update the Mercury Workflows
The Mercury workflows (Mercury_PE_Prep, Mercury_SE_Prep, and Mercury_Batch) have been updated to enable the inclusion of all required and suggested metadata as per the PHA4GE SARS-CoV-2 Contextual Data Specifications.
In addition to the submittable files to GISAID and GenBank, the Mercury workflows to prepare files for both BioSample registration, SRA submission. A protocol to utilize these new workflows for SC2 data submission has been made publicly available on Protocols.io.
Other modifications made
- Pangolin task modified to capture all software and reference versions; outputs have changed accordingly:
--pangolin_version
: deprecated
--pangolin_usher_version
: deprecated
--pangolin_versions
: all pangolin software and reference data versions
--pangolin_assignment_version
: version captured from the final pangolin report, i.e. version of inference approach utilized to make the final pango lineage assignment - Titan workflows for genomic characterization modified to remove the
pangolin_docker_image
input parameter
-- Thepangolin_docker_image
is now an optional input parameter for thepangolin3
task titleddocker
-- The default value for thepangolin3.docker
input parameter has been set tostaphb/pangolin:3.1.11-pangolearn-2021-08-24
nextclade_one_sample
task modified to allow processing of 0bp assembly files (PR by @HNH0303 #64)titan_augur_run
workflow modified to address bug regarding processing of unmasked inputs (PR by @dpark01 #62)