feat: add demo for rivulet dataset conversion #444

anshumankomawar · 2025-01-10T19:02:35Z

Summary

This demo demonstrates creating a dataset using Deltacat, starting with a Parquet file and expanding it with additional columns and records. New data (with expanded schema) is appended without modifying the original Parquet file, enabling efficient updates. Finally, the dataset is exported to feather files.

Rationale

Showcase current functionality and provide an example for users.

Changes

Added demo script to deltacat examples folder (i.e. deltacat/examples/rivulet/).

Impact

No impact to existing code.

Testing

Unit tests (make test).

Regression Risk

Checklist

Unit tests covering the changes have been added
- If this is a bugfix, regression tests have been added
E2E testing has been performed

flliver · 2025-01-10T20:29:28Z

deltacat/examples/rivulet/parquet_to_feather.py

+
+# Step 5 (Optional): Read data from feather file.
+#read_feather = feather.read_feather('./.riv-meta-contacts/data/<replace with generated file>')
+#print(read_feather)


This isn't quite right, because it's asking you to stick your nose into the private .riv-meta files. You want to either do a dataset.scan() and get back the now merged files and print those, or you want to do a (TBD) dataset.export(file-format=parquet/feather), then read that exported file.

Ideally, have an example of both.

That makes sense hadn't run into the scanner yet. Will update it to use the scanner for now (will update with the dataset export feature in the future).

thesalus

How do we want to structure the demos?

If not as a notebook, it looks like the existing examples are written following an existing pattern with configurable args that I think we can replicate here.

thesalus · 2025-01-10T20:41:24Z

deltacat/examples/rivulet/parquet_to_feather.py

+import deltacat as dc
+
+# Step 1: Create a simple 3x3 Parquet file using pyarrow
+parquet_file_path = "../contacts.parquet"


Minor: just as a matter of convenience, can we avoid creating these files within directories that are visible to the git repo (e.g., a temporary directory, some .gitignored directory)?

Similar comment on the metadata_uri.

I think it would be useful for the user to be able to view the generated files if required. But I'll go ahead and add the generated files to the gitignore.

I have added the configurable args as well. Going to leave it empty for now since I don't think this demo needs any configuration, but this should help stay consistent with other demos.

Following up on anthony's comment - why not use notebooks for demo? It seems to be pretty standard. For example:
https://github.com/Eventual-Inc/Daft/tree/main/tutorials
https://github.com/lancedb/vectordb-recipes/tree/main/tutorials/RAG-with_MatryoshkaEmbed-Llamaindex

I followed the other existing examples in the deltacat repository, but from the links you shared notebooks look like the standard for user facing demos. Shouldn't take much to update them.

flliver · 2025-01-10T22:24:48Z

deltacat/examples/rivulet/parquet_to_feather.py

+
+# Step 5: Read data from feather file.
+read_records = list(dataset.scan(QueryExpression()).to_arrow())
+[print(record) for record in read_records]


Closer! This violates pythonic behavior, instead of passing in an empty QueryExpression, just have a default value on the query expression so you can do dataset.scan().to_arrow().

Also, don't list(), generator already can do list, examples should always use the simplest, pythonic expression. i.e.

# Print records for record in dataset.scan().to_arrow(): print(record)

Is super clear/obvious/pythonic.

Ah I see what you mean, didn't realize it was being treated as an iterator already. Will update it.

anshumankomawar · 2025-01-11T08:57:45Z

deltacat/examples/rivulet/parquet_to_feather.py

-    for record in dataset.scan().to_arrow():
-        print(record)
+    for record in dataset.scan().to_pydict():
+        print(record.values())


Prints values for the record instead of py_arrow representation.

mcember · 2025-01-13T18:52:47Z

deltacat/examples/rivulet/parquet_to_feather.py

+
+def run(**kwargs):
+    # Step 1: Create a simple 3x3 Parquet file using pyarrow
+    parquet_file_path = "./contacts.parquet"


nitpick: consider using pathlib here like Path.cwd() / contacts.parquet

mcember · 2025-01-13T18:53:11Z

deltacat/examples/rivulet/parquet_to_feather.py

+
+"""
+This demo showcases
+1. How to create a dataset from a Parquet file using Deltacat.


the fact that you are breaking it into these modular steps make me think this would be good to refactor as a notebook where each step is a cell

mcember · 2025-01-13T18:55:14Z

deltacat/examples/rivulet/parquet_to_feather.py

+        name="contacts",
+        file_uri=parquet_file_path,
+        metadata_uri=".",
+        merge_keys="id"


sanity check - merge keys here can either be a string or list of strings?

yup it can be either.

mcember · 2025-01-13T18:55:48Z

deltacat/examples/rivulet/parquet_to_feather.py

+    # Step 2: Load the Parquet file into a Dataset
+    dataset = dc.Dataset.from_parquet(
+        name="contacts",
+        file_uri=parquet_file_path,


Does file path here only accept a single file? It would be more interesting to give an example using a glob path

It does support a directory as well but I'm having issues with the current implementation. Will update the demo to support the file dir in a future update.

mcember · 2025-01-13T18:56:32Z

deltacat/examples/rivulet/parquet_to_feather.py

+        ("is_active", dc.Datatype.bool())
+    ])
+
+    # Step 4: Append two new records, including values for the new columns. The cool thing with deltacat datasets is


nitpick: consider block comment

flliver · 2025-01-14T14:38:55Z

.gitignore


+# Generated Files
+**/.riv-meta-contacts/
+**/contacts.parquet
+
 # PyInstaller
 #  Usually these files are written by a python script from a template


just do */.riv-meta- and **/*.parquet so we catch all the intermediate files.

flliver · 2025-01-14T14:40:29Z

deltacat/examples/rivulet/parquet_to_feather.ipynb

+    "    merge_keys=\"id\"     # specify the merge key column\n",
+    ")\n",
+    "print(\"Loaded dataset from Parquet file.\")"


Don't do comments that just restate the obvious. Either remove the comment or describe what the point of a merge key is.

flliver · 2025-01-14T14:41:22Z

deltacat/examples/rivulet/parquet_to_feather.ipynb

+   "cell_type": "markdown",
+   "source": "### Step 3: Add two new columns to the Dataset",
+   "id": "ec9db8f54290095b"


fields (columns)

flliver · 2025-01-14T14:41:42Z

deltacat/examples/rivulet/parquet_to_feather.ipynb

+   "source": [
+    "### Step 4: Append two new Records\n",
+    "The cool thing withdeltacat datasets is that deltacat will not attempt to\n",
+    "rewrite the existing Parquet file; instead, they will store additional data\n",


no capital on records, missing a space on withdeltacat

flliver · 2025-01-14T14:42:22Z

deltacat/examples/rivulet/parquet_to_feather.ipynb

+   "source": [
+    "# Open a new writer that will write new data to feather files\n",
+    "dataset_writer = dataset.writer(file_format=\"feather\")\n",


obvious comment is obvious. Either remove or describe what a writer does and why you'd want one.

Off topic thought, I don't think we actually need a writer as an external class. I think we could just do dataset.write(new_rows). The only reason to have a writer is if you want to modify the output configuration of how data is written to the dataset. In that case I think we want to do something similar to add_schema(), where we add_writer() and that creates a new writer configuration and appends it to the metadata, then you can optionally do dataset.write(records, writer=<named_writer>). Then you can update the default writer for the dataset and it goes with the dataset, which matches our core tenet of the dataset is portable (and as part of that, defines its own writer configurations).

Note, don't do anything I said for that in this commit, just keep it in mind.

flliver

Do final nit picks and merge it!

feat: add demo for rivulet dataset conversion

c3183df

flliver reviewed Jan 10, 2025

View reviewed changes

thesalus reviewed Jan 10, 2025

View reviewed changes

fix: update rivulet demo to use data scanner

41016c2

flliver reviewed Jan 10, 2025

View reviewed changes

fix: add configurable args, improve code readability

5e262a0

anshumankomawar marked this pull request as draft January 11, 2025 02:27

anshumankomawar marked this pull request as ready for review January 11, 2025 02:43

anshumankomawar commented Jan 11, 2025

View reviewed changes

fix: improve demo printing using pydict

62b038e

anshumankomawar force-pushed the anshumankomawar/2.0 branch from cdd50d0 to 62b038e Compare January 11, 2025 09:23

mcember reviewed Jan 13, 2025

View reviewed changes

anshumankomawar marked this pull request as draft January 14, 2025 00:33

fix: minor updates and switch to juptyer notebook

706422b

anshumankomawar force-pushed the anshumankomawar/2.0 branch from c079e99 to 706422b Compare January 14, 2025 00:34

anshumankomawar marked this pull request as ready for review January 14, 2025 00:34

flliver reviewed Jan 14, 2025

View reviewed changes

flliver approved these changes Jan 14, 2025

View reviewed changes

fix: make improvements to comments

47e77e6

anshumankomawar merged commit 376a7ed into 2.0 Jan 14, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add demo for rivulet dataset conversion #444

feat: add demo for rivulet dataset conversion #444

anshumankomawar commented Jan 10, 2025

flliver Jan 10, 2025

flliver Jan 10, 2025

anshumankomawar Jan 10, 2025

thesalus left a comment

thesalus Jan 10, 2025

anshumankomawar Jan 11, 2025

anshumankomawar Jan 11, 2025

mcember Jan 13, 2025

anshumankomawar Jan 13, 2025

flliver Jan 10, 2025 •

edited

Loading

anshumankomawar Jan 10, 2025

anshumankomawar Jan 11, 2025

mcember Jan 13, 2025

mcember Jan 13, 2025

mcember Jan 13, 2025

anshumankomawar Jan 13, 2025

mcember Jan 13, 2025

anshumankomawar Jan 13, 2025 •

edited

Loading

mcember Jan 13, 2025

flliver Jan 14, 2025

flliver Jan 14, 2025

flliver Jan 14, 2025

flliver Jan 14, 2025

flliver Jan 14, 2025

flliver Jan 14, 2025

flliver Jan 14, 2025

flliver left a comment

feat: add demo for rivulet dataset conversion #444

feat: add demo for rivulet dataset conversion #444

Conversation

anshumankomawar commented Jan 10, 2025

Summary

Rationale

Changes

Impact

Testing

Regression Risk

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thesalus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flliver Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anshumankomawar Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flliver left a comment

Choose a reason for hiding this comment

flliver Jan 10, 2025 •

edited

Loading

anshumankomawar Jan 13, 2025 •

edited

Loading