Allow subsetting TOL classifier #59

johnbradley · 2024-10-29T20:34:12Z

Adds functions to TreeOfLifeClassifier to allow subsetting the embeddings. The get_label_data() method returns a dataframe of the labels for the TOL txt embeddings. The create_taxa_filter() method creates a filter (boolean array) for the txt embeddings based on a taxa and values. The apply_filter() method filters the classifier for a filter. The filter is a boolean array with the same length as the txt embeddings.

Part of #56

Adds functions to TreeOfLifeClassifier to allow subsetting the embeddings. The get_label_data() method returns a dataframe of the labels for the TOL txt embeddings. The create_taxa_filter() method creates a filter (boolean array) for the txt embeddings based on a taxa and values. The apply_filter() method filters the classifier for a filter. The filter is a boolean array with the same length as the txt embeddings. Part of #56

egrace479

A few suggestions for clarity. The notebook was very helpful for visualizing the update.

README.md

src/bioclip/predict.py

tests/test_predict.py

Co-authored-by: Elizabeth Campolongo <[email protected]>

egrace479

Nice update!

thompsonmj · 2024-11-13T22:20:16Z

Should there maybe be a disclaimer somewhere that incorrect training spellings for taxa may adversely affect the filter bevahior?
For example, @egrace479 noted that both "Ursus arctos" and "Ursus arctus" appear in the training split unfortunately.

This would impact the example in the notebook:

label_data = classifier.get_label_data()
taxa_filter = ~label_data.species.isin(["Ursus arctos", "Ursus arctos syriacus"])

hlapp · 2024-11-14T03:17:51Z

I don't think we need a special disclaimer for that. Otherwise we'll also need disclaimers for all the synonyms etc that aren't properly consolidated to canonical name, etc. Those who want to understand the prediction in detail will invariably have to familiarize themselves with the paper and the data and processing code repositories.

johnbradley requested review from hlapp and egrace479 October 29, 2024 20:39

egrace479 reviewed Nov 12, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

src/bioclip/predict.py Outdated Show resolved Hide resolved

tests/test_predict.py Outdated Show resolved Hide resolved

johnbradley and others added 3 commits November 12, 2024 14:31

Update README.md

3477628

Co-authored-by: Elizabeth Campolongo <[email protected]>

Update tests/test_predict.py

fab4863

Co-authored-by: Elizabeth Campolongo <[email protected]>

Improve txt_names function name

1fd3204

johnbradley requested a review from egrace479 November 12, 2024 19:49

egrace479 approved these changes Nov 12, 2024

View reviewed changes

johnbradley requested review from thompsonmj and removed request for hlapp November 13, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow subsetting TOL classifier #59

Allow subsetting TOL classifier #59

johnbradley commented Oct 29, 2024

egrace479 left a comment

egrace479 left a comment

thompsonmj commented Nov 13, 2024 •

edited

Loading

hlapp commented Nov 14, 2024

Allow subsetting TOL classifier #59

Are you sure you want to change the base?

Allow subsetting TOL classifier #59

Conversation

johnbradley commented Oct 29, 2024

egrace479 left a comment

Choose a reason for hiding this comment

egrace479 left a comment

Choose a reason for hiding this comment

thompsonmj commented Nov 13, 2024 • edited Loading

hlapp commented Nov 14, 2024

thompsonmj commented Nov 13, 2024 •

edited

Loading