Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow subsetting TOL classifier #59

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Allow subsetting TOL classifier #59

wants to merge 4 commits into from

Conversation

johnbradley
Copy link
Collaborator

Adds functions to TreeOfLifeClassifier to allow subsetting the embeddings. The get_label_data() method returns a dataframe of the labels for the TOL txt embeddings. The create_taxa_filter() method creates a filter (boolean array) for the txt embeddings based on a taxa and values. The apply_filter() method filters the classifier for a filter. The filter is a boolean array with the same length as the txt embeddings.

Part of #56

Adds functions to TreeOfLifeClassifier to allow subsetting
the embeddings. The get_label_data() method returns a dataframe
of the labels for the TOL txt embeddings. The create_taxa_filter()
method creates a filter (boolean array) for the txt embeddings
based on a taxa and values. The apply_filter() method filters the
classifier for a filter. The filter is a boolean array with the
same length as the txt embeddings.

Part of #56
Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions for clarity. The notebook was very helpful for visualizing the update.

README.md Outdated Show resolved Hide resolved
src/bioclip/predict.py Outdated Show resolved Hide resolved
tests/test_predict.py Outdated Show resolved Hide resolved
johnbradley and others added 3 commits November 12, 2024 14:31
Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice update!

@johnbradley johnbradley requested review from thompsonmj and removed request for hlapp November 13, 2024 20:54
@thompsonmj
Copy link
Contributor

thompsonmj commented Nov 13, 2024

Should there maybe be a disclaimer somewhere that incorrect training spellings for taxa may adversely affect the filter bevahior?
For example, @egrace479 noted that both "Ursus arctos" and "Ursus arctus" appear in the training split unfortunately.

This would impact the example in the notebook:

label_data = classifier.get_label_data()
taxa_filter = ~label_data.species.isin(["Ursus arctos", "Ursus arctos syriacus"])

@hlapp
Copy link
Member

hlapp commented Nov 14, 2024

I don't think we need a special disclaimer for that. Otherwise we'll also need disclaimers for all the synonyms etc that aren't properly consolidated to canonical name, etc. Those who want to understand the prediction in detail will invariably have to familiarize themselves with the paper and the data and processing code repositories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants