Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Enable features with dtype = 'str' #2226

Merged
merged 25 commits into from
Nov 28, 2024
Merged

✨ Enable features with dtype = 'str' #2226

merged 25 commits into from
Nov 28, 2024

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Nov 28, 2024

Summary

  • It's now possible to define features of dtype="str"
  • Edge cases around categorical and non-categorical string features are worked into the user feedback; auto-inferred categorical feature types are no longer a thing; the user needs to choose
  • Feature dtype="number" got renamed to dtype="num"
  • You can no longer manually annotate features that are internal to the dataset, so that the curation state can't be corrupted
  • If you try to re-create a Feature with an inconsistent dtype, you'll get an error

The valid feature dtypes are now:

FeatureDtype = Literal[
    "cat",  # categorical variables
    "num",  # numerical variables
    "str",  # string variables
    "int",  # integer variables
    "float",  # float variables
    "bool",  # boolean variables
    "date",  # date variables
    "datetime",  # datetime variables
    "object",  # this is a pandas type, we're only using it for complicated types, not for strings
]

In addition, specifications like cat[ULabel] or cat[bionty.CellType] continue to be valid.

Features with dtype="str" do not get annotated with values on the observation-level

Annotate a dataset with feature values via features.add_values() as with all other dtypes.

>>> import lamindb as ln
>>> import pandas as pd

>>> df = pd.DataFrame({"sample_note": ["was ok", "looks naah", "pretty! 🤩"]})
>>> artifact = ln.Artifact.from_df(df, description="My blob").save()

>>> ln.Feature(name="study_note", dtype="str").save()
>>> artifact.features.add_values({
>>>     "study_note": "We had a great time performing this experiment and the results look compelling."
>>> })
>>> artifact.features
Feature values -- external
    'study_note': str = We had a great time performing this experiment and the results look compelling.

Observation-level annotations with values can only be achieved through the Curator flow. Unlike for categorical dtypes, but analogously to numerical and boolean dtypes, values do not get annotated.

Example: The values for "sample_note" do not occur under "describe()". The feature only appears as part of the feature set that characterizes the dataset.

>>> ln.Feature(name="sample_note", dtype="str").save()

>>> curator = ln.Curator.from_df(df)
>>> artifact = curator.save_artifact(description="My example")
>>> artifact.features
Feature sets
    'columns': Feature = 'sample_note'
Feature values -- external
    'study_note': str = We had a great time performing this experiment and the results look compelling.

Docs changes

Before After
image image
image image
image image

Comprehensive example

Create a comprehensive test schema that covers the new dtype "str":

ln.Feature(name="cell_medium", dtype="cat[ULabel]").save()
ln.Feature(name="sample_note", dtype="str").save()
ln.Feature(name="cell_type_by_expert", dtype="cat[bionty.CellType]").save()
ln.Feature(name="cell_type_by_model", dtype="cat[bionty.CellType]").save()
ln.Feature(name="temperature", dtype="float").save()
ln.Feature(name="study", dtype="cat[ULabel]").save()
ln.Feature(name="date_of_study", dtype="date").save()
ln.Feature(name="study_note", dtype="str").save()

Build a test case:

# define the data in the dataset 
# it's a mix of numerical measurements and observation-level metadata
dataset_dict =  {
    "CD8A": [1, 2, 3],
    "CD4": [3, 4, 5],
    "CD14": [5, 6, 7],
    "cell_medium": ["DMSO", "IFNG", "DMSO"],
    "sample_note": ["was ok", "looks naah", "pretty! 🤩"],
    "cell_type_by_expert": ["B cell", "T cell", "T cell"],
    "cell_type_by_model": ["B cell", "T cell", "T cell"],
}
# define the dataset-level metadata
metadata = {
    "temperature": 21.6,
    "study": "Candidate marker study 1",
    "date_of_study": "2024-12-01",
    "study_note": "We had a great time performing this experiment and the results look compelling.",
}
# the dataset as DataFrame
dataset_df = pd.DataFrame(dataset_dict, index=["sample1", "sample2", "sample3"])
dataset_ad = ad.AnnData(dataset_df.iloc[:, :3], obs=dataset_df.iloc[:, 3:])
# curate dataset
curator = ln.Curator.from_anndata(
    dataset_ad,
    var_index=bt.Gene.symbol,
    categoricals={
        "cell_medium": ln.ULabel.name,
        "cell_type_by_expert": bt.CellType.name,
        "cell_type_by_model": bt.CellType.name,
    },
    organism="human",
)
artifact = curator.save_artifact(key="example_datasets/dataset1.h5ad")
# annotate with dataset-level features
artifact.features.add_values(metadata)

Validate the description of this test dataset:

Artifact(uid='Tm5s2sgI1F48eLLI0000', is_latest=True, key='example_datasets/dataset1.h5ad', suffix='.h5ad', type='dataset', size=23560, hash='voB-uoihaivmNskhV7osPQ', n_observations=3, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-28 13:13:31 UTC)
  Provenance
    .storage: Storage = '/Users/falexwolf/repos/laminhub/rest-hub/sub/lamindb/default_storage_unit_core'
    .created_by: User = 'falexwolf'
  Labels
    .cell_types: bionty.CellType = 'B cell', 'T cell'
    .ulabels: ULabel = 'DMSO', 'IFNG', 'Candidate marker study 1'
  Feature sets
    'var' = 'CD8A', 'CD4', 'CD14'
    'obs' = 'cell_medium', 'sample_note', 'cell_type_by_expert', 'cell_type_by_model'
  Feature values -- internal
    'cell_type_by_expert': cat[bionty.CellType] = B cell, T cell
    'cell_type_by_model': cat[bionty.CellType] = B cell, T cell
    'cell_medium': cat[ULabel] = DMSO, IFNG
  Feature values -- external
    'study': cat[ULabel] = Candidate marker study 1
    'date_of_study': date = 2024-12-01
    'study_note': str = We had a great time performing this experiment and the results look compelling.
    'temperature': float = 21.6

Materials

Resolves:

@falexwolf falexwolf changed the title ✨ Refactor features and allow str dtype ✨ Enable features width dtype = 'str' Nov 28, 2024
@falexwolf falexwolf requested review from sunnyosun and Zethson and removed request for sunnyosun November 28, 2024 13:18
lamindb/_feature.py Outdated Show resolved Hide resolved
lamindb/_feature.py Outdated Show resolved Hide resolved
lamindb/_feature.py Outdated Show resolved Hide resolved
lamindb/core/_data.py Outdated Show resolved Hide resolved
lamindb/core/_data.py Outdated Show resolved Hide resolved
tests/core/test_curate_annotate_df.py Show resolved Hide resolved
tests/core/test_feature.py Outdated Show resolved Hide resolved
lamindb/_feature.py Outdated Show resolved Hide resolved
@Zethson Zethson changed the title ✨ Enable features width dtype = 'str' ✨ Enable features with dtype = 'str' Nov 28, 2024
@falexwolf falexwolf linked an issue Nov 28, 2024 that may be closed by this pull request
Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 87.91209% with 11 lines in your changes missing coverage. Please review.

Project coverage is 92.90%. Comparing base (c54f99f) to head (0eda652).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
lamindb/core/_feature_manager.py 64.70% 6 Missing ⚠️
lamindb/_feature.py 87.80% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2226      +/-   ##
==========================================
+ Coverage   92.36%   92.90%   +0.53%     
==========================================
  Files          54       54              
  Lines        6566     6662      +96     
==========================================
+ Hits         6065     6189     +124     
+ Misses        501      473      -28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Nov 28, 2024

@github-actions github-actions bot temporarily deployed to pull request November 28, 2024 21:26 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 28, 2024 23:18 Inactive
@falexwolf falexwolf merged commit 9cf70ea into main Nov 28, 2024
15 of 16 checks passed
@falexwolf falexwolf deleted the df branch November 28, 2024 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🚸 Improve the "pass a dtype" message for Feature constructor
2 participants