Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎨 Centralize a small test dataset for curation & queries #2234

Merged
merged 3 commits into from
Nov 29, 2024

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Nov 29, 2024

This adds ln.core.datasets.small_dataset1 and uses it to for additional tests of how FeatureManager deals with pandas and numpy types.

def small_dataset1(
    format: Literal["df", "anndata"],
) -> tuple[pd.DataFrame, dict[str, Any]] | ad.AnnData:
    # define the data in the dataset
    # it's a mix of numerical measurements and observation-level metadata
    dataset_dict = {
        "CD8A": [1, 2, 3],
        "CD4": [3, 4, 5],
        "CD14": [5, 6, 7],
        "cell_medium": ["DMSO", "IFNG", "DMSO"],
        "sample_note": ["was ok", "looks naah", "pretty! 🤩"],
        "cell_type_by_expert": ["B cell", "T cell", "T cell"],
        "cell_type_by_model": ["B cell", "T cell", "T cell"],
    }
    # define the dataset-level metadata
    metadata = {
        "temperature": 21.6,
        "study": "Candidate marker study 1",
        "date_of_study": "2024-12-01",
        "study_note": "We had a great time performing this study and the results look compelling.",
    }
    # the dataset as DataFrame
    dataset_df = pd.DataFrame(dataset_dict, index=["sample1", "sample2", "sample3"])
    dataset_ad = ad.AnnData(
        dataset_df.iloc[:, :3], obs=dataset_df.iloc[:, 3:], uns=metadata
    )
    if format == "df":
        return dataset_df, metadata
    else:
        return dataset_ad

@sunnyosun, @Zethson, @Koncopd -- my hope is that we can keep iterating on this dataset and re-use it across test scenarios relating to curation.

It's in fact pretty hard to come up with good test datasets that cover different cases comprehensively.

@falexwolf falexwolf changed the title 🎨 Refactor test datasets 🎨 Centralize a small test dataset for curation Nov 29, 2024
Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 76.31579% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.88%. Comparing base (c54f99f) to head (6daad43).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
lamindb/core/_feature_manager.py 69.23% 8 Missing ⚠️
lamindb/_feature.py 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2234      +/-   ##
==========================================
+ Coverage   92.36%   92.88%   +0.51%     
==========================================
  Files          54       54              
  Lines        6566     6687     +121     
==========================================
+ Hits         6065     6211     +146     
+ Misses        501      476      -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@github-actions github-actions bot temporarily deployed to pull request November 29, 2024 13:28 Inactive
@falexwolf falexwolf merged commit 7019150 into main Nov 29, 2024
15 of 16 checks passed
@falexwolf falexwolf deleted the pandastypes branch November 29, 2024 13:45
@falexwolf falexwolf changed the title 🎨 Centralize a small test dataset for curation 🎨 Centralize a small test dataset for curation & queries Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant