Skip to content

Commit

Permalink
♻️ Add .standardize() to Curator and refactor (#2186)
Browse files Browse the repository at this point in the history
Signed-off-by: zethson <[email protected]>
Co-authored-by: zethson <[email protected]>
Co-authored-by: Alex Wolf <[email protected]>
  • Loading branch information
3 people authored Nov 27, 2024
1 parent 2e29863 commit 52492e7
Show file tree
Hide file tree
Showing 5 changed files with 620 additions and 361 deletions.
123 changes: 106 additions & 17 deletions docs/curate-df.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,76 @@
"source": [
"# Curate DataFrames and AnnDatas\n",
"\n",
"Curating datasets typically means three things:\n",
"Curating a dataset with LaminDB means three things:\n",
"\n",
"1. Validate: ensure a dataset meets predefined _validation criteria_\n",
"2. Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers\n",
"3. Annotate: link a dataset against metadata records\n",
"1. **Validate:** ensure the dataset meets predefined _validation criteria_\n",
"2. **Standardize:** transform the dataset so that it meets validation criteria, e.g., by fixing typos or using standard instead of ad hoc identifiers\n",
"3. **Annotate:** link the dataset against validated metadata so that it becomes queryable\n",
"\n",
"In LaminDB, valid metadata is metadata that's stored in a metadata registry and _validation criteria_ merely defines a mapping onto a field of a registry.\n",
"If a dataset passes validation, curating it takes two lines of code:\n",
"\n",
"```{admonition} Example\n",
"```python\n",
"curator = ln.Curator.from_df(df, ...) # create a Curator and pass criteria in \"...\"\n",
"curator.save_artifact() # validates the content of the dataset and saves it as annotated artifact\n",
"```\n",
"\n",
"`\"Experiment 1\"` is a valid value for `ULabel.name` if a record with this name exists in the {class}`~lamindb.ULabel` registry.\n",
"Beyond having valid content, the curated dataset is now queryable via metadata identifiers found in the dataset because they have been validated & linked against LaminDB registries.\n",
"\n",
"```"
":::{admonition} Definition: valid metadata identifier\n",
"\n",
"An identifier like `\"Experiment 1\"` is a valid value for `ULabel.name` if a record with `name` `\"Experiment 1\"` exists in the {class}`~lamindb.ULabel` registry.\n",
"\n",
"```python\n",
"categoricals = {\"experiment\": ln.ULabel.name} # the validation constraint\n",
"curator = ln.Curator.from_df(df, categoricals=categoricals)\n",
"curator.validate()\n",
"```\n",
"\n",
"The DataFrame validates if \n",
"\n",
"- there is a column with name `\"experiment\"` in the dataframe whose values are all found in the `name` field of the {class}`~lamindb.ULabel` registry\n",
"- the column name `\"experiment\"` is found in the `name` field of the {class}`~lamindb.Feature` registry\n",
"\n",
":::\n",
"\n",
"Beyond validating metadata identifiers, LaminDB also validates data types and dataset schema.\n",
"\n",
":::{dropdown} How does validation in LaminDB compare to validation in pandera?\n",
"\n",
"Like LaminDB, [pandera](https://pandera.readthedocs.io/) validates the _dataset schema_ (i.e., column names and `dtype`s).\n",
"\n",
"`pandera` is only available for `DataFrame`-like datasets and cannot annotate datasets; i.e., can't make datasets queryable.\n",
"\n",
"However, it offers an API for range-checks, both for numerical and string-like data. If you need such checks, you can combine LaminDB and pandera-based validation.\n",
"\n",
"```python\n",
"import pandas as pd\n",
"import pandera as pa\n",
"\n",
"# data to validate\n",
"df = pd.DataFrame({\n",
" \"column1\": [1, 4, 0, 10, 9],\n",
" \"column2\": [-1.3, -1.4, -2.9, -10.1, -20.4],\n",
" \"column3\": [\"value_1\", \"value_2\", \"value_3\", \"value_2\", \"value_1\"],\n",
"})\n",
"\n",
"# define schema\n",
"schema = pa.DataFrameSchema({\n",
" \"column1\": pa.Column(int, checks=pa.Check.le(10)),\n",
" \"column2\": pa.Column(float, checks=pa.Check.lt(-1.2)),\n",
" \"column3\": pa.Column(str, checks=[\n",
" pa.Check.str_startswith(\"value_\"),\n",
" # define custom checks as functions that take a series as input and\n",
" # outputs a boolean or boolean Series\n",
" pa.Check(lambda s: s.str.split(\"_\", expand=True).shape[1] == 2)\n",
" ]),\n",
"})\n",
"\n",
"validated_df = schema(df) # this corresponds to curator.validate() in LaminDB\n",
"print(validated_df)\n",
"```\n",
"\n",
":::"
]
},
{
Expand All @@ -42,7 +99,7 @@
"id": "946a3371",
"metadata": {},
"source": [
"## Validate a DataFrame"
"## Curate a DataFrame"
]
},
{
Expand Down Expand Up @@ -72,7 +129,7 @@
"df = pd.DataFrame(\n",
" {\n",
" \"temperature\": [37.2, 36.3, 38.2],\n",
" \"cell_type\": [\"cerebral pyramidal neuron\", \"astrocyte\", \"oligodendrocyte\"],\n",
" \"cell_type\": [\"cerebral pyramidal neuron\", \"astrocytic glia\", \"oligodendrocyte\"],\n",
" \"assay_ontology_id\": [\"EFO:0008913\", \"EFO:0008913\", \"EFO:0008913\"],\n",
" \"donor\": [\"D0001\", \"D0002\", \"D0003\"]\n",
" },\n",
Expand Down Expand Up @@ -134,22 +191,54 @@
"curate.validate()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7acf0d2",
"metadata": {},
"outputs": [],
"source": [
"# check the non-validated terms\n",
"curate.non_validated"
]
},
{
"cell_type": "markdown",
"id": "7c157df6",
"id": "8c2417c7",
"metadata": {},
"source": [
"## Register new metadata values\n",
"For `cell_type`, we saw that \"cerebral pyramidal neuron\", \"astrocytic glia\" are not validated.\n",
"\n",
"If you see \"non-validated\" values, you'll need to decide whether to add them to your registries or \"fix\" them in your dataset."
"First, let's standardize synonym \"astrocytic glia\" as suggested"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35b3ce8e",
"metadata": {},
"outputs": [],
"source": [
"curate.standardize(\"cell_type\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "336293ac",
"metadata": {},
"outputs": [],
"source": [
"# now we have only one non-validated term left\n",
"curate.non_validated"
]
},
{
"cell_type": "markdown",
"id": "8c2417c7",
"id": "c1bfe41c",
"metadata": {},
"source": [
"For `cell_type`, we saw that 'cerebral pyramidal neuron' is not validated, let's understand which cell type in the public ontology might be the actual match."
"For \"cerebral pyramidal neuron\", let's understand which cell type in the public ontology might be the actual match."
]
},
{
Expand Down Expand Up @@ -244,7 +333,7 @@
"id": "b9d09a10",
"metadata": {},
"source": [
"## Validate an AnnData\n",
"## Curate an AnnData\n",
"\n",
"Here we additionally specify which `var_index` to validate against."
]
Expand Down Expand Up @@ -466,7 +555,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
"version": "3.10.15"
},
"nbproject": {
"id": "WOK3vP0bNGLx",
Expand Down
Loading

0 comments on commit 52492e7

Please sign in to comment.