Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rule to annotate GIHSN samples #198

Merged
merged 1 commit into from
Dec 17, 2024
Merged

Add rule to annotate GIHSN samples #198

merged 1 commit into from
Dec 17, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Nov 20, 2024

Description of proposed changes

Adds a new gihsn_sample column to the metadata to indicate whether
"GIHSN" was found in the strain as proxy for whether the sample came
from the Global Influenza Hospital Surveillance Network (GIHSN).
Follows the existing pattern of using True and False boolean values.

Related issue(s)

Resolves #196

Checklist

  • Checks pass

Adds a new `gihsn_sample` column to the metadata to indicate whether
"GIHSN" was found in the `strain` as proxy for whether the sample came
from the Global Influenza Hospital Surveillance Network (GIHSN).
Follows the existing pattern of using `True` and `False` boolean values.

Resolves <#196>
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this one, @joverlee521! Seeing the implementation clarifies my original concerns about adding such a specific column to the metadata when the column represents <1% of all HA sequences and the same information exists in the strain name already.

Maybe we could figure out what the specific uses for this column would be before committing to this change? I can imagine two main use cases:

  1. Users want to filter for sequences from the GIHSN in their subsampling logic for their builds, so they either can include those strains or exclude them. For this use case, users could use an argument like --query "'GIHSN' in strain" to select for those strains.
  2. Users want to filter/color sequences in a tree by the GIHSN status. For this case, users would need an additional column in the metadata like the one from this PR to enable filtering/coloring in Auspice. If users are running their own builds, they could use a custom rule based on the same code in this PR to add the column to their subsampled metadata just before augur export runs (we could help them with this). If users want to apply these filters/colorings to our public builds, then we could use the same approach of adding the custom metadata column in the workflow just before export.

If the use cases are more about selecting/filtering sequences by sampling strategy, GISAID has a "sampling strategy" column in their metadata. It isn't populated for these GIHSN samples, but we could implement a derived version of that column for which we populate empty strings with "GIHSN", for example.

@rneher Are there other reasons we might need to have GIHSN as a column in our metadata for all samples?

@joverlee521 joverlee521 marked this pull request as draft November 21, 2024 17:22
@rneher
Copy link
Member

rneher commented Dec 17, 2024

I think having this as a column is probably easiest. This should compress very well and we are only talking about a hundred of thousand, rather than millions of rows.

@rneher
Copy link
Member

rneher commented Dec 17, 2024

I think the chosen column name is fine. this should help us to include them preferentially into the build and add a coloring.

@joverlee521 joverlee521 marked this pull request as ready for review December 17, 2024 18:11
@joverlee521
Copy link
Contributor Author

Thanks both! I'll merge this so that we'll have the new column after Thursday's update.

@joverlee521 joverlee521 merged commit 20f74cf into master Dec 17, 2024
3 checks passed
@joverlee521 joverlee521 deleted the gishn-strains branch December 17, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Annotate GIHSN strains
4 participants