♻️ Add .standardize() to Curator and refactor (#2186)

Signed-off-by: zethson <[email protected]> Co-authored-by: zethson <[email protected]> Co-authored-by: Alex Wolf <[email protected]>
laminlabs · Nov 27, 2024 · 52492e7 · 52492e7
1 parent 2e29863
commit 52492e7
Show file tree

Hide file tree

Showing 5 changed files with 620 additions and 361 deletions.
diff --git a/docs/curate-df.ipynb b/docs/curate-df.ipynb
@@ -7,19 +7,76 @@
    "source": [
     "# Curate DataFrames and AnnDatas\n",
     "\n",
-    "Curating datasets typically means three things:\n",
+    "Curating a dataset with LaminDB means three things:\n",
     "\n",
-    "1. Validate: ensure a dataset meets predefined _validation criteria_\n",
-    "2. Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers\n",
-    "3. Annotate: link a dataset against metadata records\n",
+    "1. **Validate:** ensure the dataset meets predefined _validation criteria_\n",
+    "2. **Standardize:** transform the dataset so that it meets validation criteria, e.g., by fixing typos or using standard instead of ad hoc identifiers\n",
+    "3. **Annotate:** link the dataset against validated metadata so that it becomes queryable\n",
     "\n",
-    "In LaminDB, valid metadata is metadata that's stored in a metadata registry and _validation criteria_ merely defines a mapping onto a field of a registry.\n",
+    "If a dataset passes validation, curating it takes two lines of code:\n",
     "\n",
-    "```{admonition} Example\n",
+    "```python\n",
+    "curator = ln.Curator.from_df(df, ...)  # create a Curator and pass criteria in \"...\"\n",
+    "curator.save_artifact()                # validates the content of the dataset and saves it as annotated artifact\n",
+    "```\n",
     "\n",
-    "`\"Experiment 1\"` is a valid value for `ULabel.name` if a record with this name exists in the {class}`~lamindb.ULabel` registry.\n",
+    "Beyond having valid content, the curated dataset is now queryable via metadata identifiers found in the dataset because they have been validated & linked against LaminDB registries.\n",
     "\n",
-    "```"
+    ":::{admonition} Definition: valid metadata identifier\n",
+    "\n",
+    "An identifier like `\"Experiment 1\"` is a valid value for `ULabel.name` if a record with `name` `\"Experiment 1\"` exists in the {class}`~lamindb.ULabel` registry.\n",
+    "\n",
+    "```python\n",
+    "categoricals = {\"experiment\": ln.ULabel.name}  # the validation constraint\n",
+    "curator = ln.Curator.from_df(df, categoricals=categoricals)\n",
+    "curator.validate()\n",
+    "```\n",
+    "\n",
+    "The DataFrame validates if \n",
+    "\n",
+    "- there is a column with name `\"experiment\"` in the dataframe whose values are all found in the `name` field of the {class}`~lamindb.ULabel` registry\n",
+    "- the column name `\"experiment\"` is found in the `name` field of the {class}`~lamindb.Feature` registry\n",
+    "\n",
+    ":::\n",
+    "\n",
+    "Beyond validating metadata identifiers, LaminDB also validates data types and dataset schema.\n",
+    "\n",
+    ":::{dropdown} How does validation in LaminDB compare to validation in pandera?\n",
+    "\n",
+    "Like LaminDB, [pandera](https://pandera.readthedocs.io/) validates the _dataset schema_ (i.e., column names and `dtype`s).\n",
+    "\n",
+    "`pandera` is only available for `DataFrame`-like datasets and cannot annotate datasets; i.e., can't make datasets queryable.\n",
+    "\n",
+    "However, it offers an API for range-checks, both for numerical and string-like data. If you need such checks, you can combine LaminDB and pandera-based validation.\n",
+    "\n",
+    "```python\n",
+    "import pandas as pd\n",
+    "import pandera as pa\n",
+    "\n",
+    "# data to validate\n",
+    "df = pd.DataFrame({\n",
+    "    \"column1\": [1, 4, 0, 10, 9],\n",
+    "    \"column2\": [-1.3, -1.4, -2.9, -10.1, -20.4],\n",
+    "    \"column3\": [\"value_1\", \"value_2\", \"value_3\", \"value_2\", \"value_1\"],\n",
+    "})\n",
+    "\n",
+    "# define schema\n",
+    "schema = pa.DataFrameSchema({\n",
+    "    \"column1\": pa.Column(int, checks=pa.Check.le(10)),\n",
+    "    \"column2\": pa.Column(float, checks=pa.Check.lt(-1.2)),\n",
+    "    \"column3\": pa.Column(str, checks=[\n",
+    "        pa.Check.str_startswith(\"value_\"),\n",
+    "        # define custom checks as functions that take a series as input and\n",
+    "        # outputs a boolean or boolean Series\n",
+    "        pa.Check(lambda s: s.str.split(\"_\", expand=True).shape[1] == 2)\n",
+    "    ]),\n",
+    "})\n",
+    "\n",
+    "validated_df = schema(df)  # this corresponds to curator.validate() in LaminDB\n",
+    "print(validated_df)\n",
+    "```\n",
+    "\n",
+    ":::"
    ]
   },
   {
@@ -42,7 +99,7 @@
    "id": "946a3371",
    "metadata": {},
    "source": [
-    "## Validate a DataFrame"
+    "## Curate a DataFrame"
    ]
   },
   {
@@ -72,7 +129,7 @@
     "df = pd.DataFrame(\n",
     "    {\n",
     "        \"temperature\": [37.2, 36.3, 38.2],\n",
-    "        \"cell_type\": [\"cerebral pyramidal neuron\", \"astrocyte\", \"oligodendrocyte\"],\n",
+    "        \"cell_type\": [\"cerebral pyramidal neuron\", \"astrocytic glia\", \"oligodendrocyte\"],\n",
     "        \"assay_ontology_id\": [\"EFO:0008913\", \"EFO:0008913\", \"EFO:0008913\"],\n",
     "        \"donor\": [\"D0001\", \"D0002\", \"D0003\"]\n",
     "    },\n",
@@ -134,22 +191,54 @@
     "curate.validate()"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7acf0d2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check the non-validated terms\n",
+    "curate.non_validated"
+   ]
+  },
   {
    "cell_type": "markdown",
-   "id": "7c157df6",
+   "id": "8c2417c7",
    "metadata": {},
    "source": [
-    "## Register new metadata values\n",
+    "For `cell_type`, we saw that \"cerebral pyramidal neuron\", \"astrocytic glia\" are not validated.\n",
     "\n",
-    "If you see \"non-validated\" values, you'll need to decide whether to add them to your registries or \"fix\" them in your dataset."
+    "First, let's standardize synonym \"astrocytic glia\" as suggested"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "35b3ce8e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "curate.standardize(\"cell_type\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "336293ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# now we have only one non-validated term left\n",
+    "curate.non_validated"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "8c2417c7",
+   "id": "c1bfe41c",
    "metadata": {},
    "source": [
-    "For `cell_type`, we saw that 'cerebral pyramidal neuron' is not validated, let's understand which cell type in the public ontology might be the actual match."
+    "For \"cerebral pyramidal neuron\", let's understand which cell type in the public ontology might be the actual match."
    ]
   },
   {
@@ -244,7 +333,7 @@
    "id": "b9d09a10",
    "metadata": {},
    "source": [
-    "## Validate an AnnData\n",
+    "## Curate an AnnData\n",
     "\n",
     "Here we additionally specify which `var_index` to validate against."
    ]
@@ -466,7 +555,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.10.15"
   },
   "nbproject": {
    "id": "WOK3vP0bNGLx",