-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rule to annotate GIHSN samples #198
Conversation
Adds a new `gihsn_sample` column to the metadata to indicate whether "GIHSN" was found in the `strain` as proxy for whether the sample came from the Global Influenza Hospital Surveillance Network (GIHSN). Follows the existing pattern of using `True` and `False` boolean values. Resolves <#196>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking this one, @joverlee521! Seeing the implementation clarifies my original concerns about adding such a specific column to the metadata when the column represents <1% of all HA sequences and the same information exists in the strain name already.
Maybe we could figure out what the specific uses for this column would be before committing to this change? I can imagine two main use cases:
- Users want to filter for sequences from the GIHSN in their subsampling logic for their builds, so they either can include those strains or exclude them. For this use case, users could use an argument like
--query "'GIHSN' in strain"
to select for those strains. - Users want to filter/color sequences in a tree by the GIHSN status. For this case, users would need an additional column in the metadata like the one from this PR to enable filtering/coloring in Auspice. If users are running their own builds, they could use a custom rule based on the same code in this PR to add the column to their subsampled metadata just before augur export runs (we could help them with this). If users want to apply these filters/colorings to our public builds, then we could use the same approach of adding the custom metadata column in the workflow just before export.
If the use cases are more about selecting/filtering sequences by sampling strategy, GISAID has a "sampling strategy" column in their metadata. It isn't populated for these GIHSN samples, but we could implement a derived version of that column for which we populate empty strings with "GIHSN", for example.
@rneher Are there other reasons we might need to have GIHSN as a column in our metadata for all samples?
I think having this as a column is probably easiest. This should compress very well and we are only talking about a hundred of thousand, rather than millions of rows. |
I think the chosen column name is fine. this should help us to include them preferentially into the build and add a coloring. |
Thanks both! I'll merge this so that we'll have the new column after Thursday's update. |
Description of proposed changes
Adds a new
gihsn_sample
column to the metadata to indicate whether"GIHSN" was found in the
strain
as proxy for whether the sample camefrom the Global Influenza Hospital Surveillance Network (GIHSN).
Follows the existing pattern of using
True
andFalse
boolean values.Related issue(s)
Resolves #196
Checklist