Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add demo for rivulet dataset conversion #444

Merged
merged 6 commits into from
Jan 14, 2025

Conversation

anshumankomawar
Copy link
Collaborator

Summary

This demo demonstrates creating a dataset using Deltacat, starting with a Parquet file and expanding it with additional columns and records. New data (with expanded schema) is appended without modifying the original Parquet file, enabling efficient updates. Finally, the dataset is exported to feather files.

Rationale

Showcase current functionality and provide an example for users.

Changes

Added demo script to deltacat examples folder (i.e. deltacat/examples/rivulet/).

Impact

No impact to existing code.

Testing

Unit tests (make test).

Regression Risk

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

Comment on lines 77 to 80

# Step 5 (Optional): Read data from feather file.
#read_feather = feather.read_feather('./.riv-meta-contacts/data/<replace with generated file>')
#print(read_feather)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right, because it's asking you to stick your nose into the private .riv-meta files. You want to either do a dataset.scan() and get back the now merged files and print those, or you want to do a (TBD) dataset.export(file-format=parquet/feather), then read that exported file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, have an example of both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense hadn't run into the scanner yet. Will update it to use the scanner for now (will update with the dataset export feature in the future).

Copy link
Collaborator

@thesalus thesalus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we want to structure the demos?

If not as a notebook, it looks like the existing examples are written following an existing pattern with configurable args that I think we can replicate here.

import deltacat as dc

# Step 1: Create a simple 3x3 Parquet file using pyarrow
parquet_file_path = "../contacts.parquet"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: just as a matter of convenience, can we avoid creating these files within directories that are visible to the git repo (e.g., a temporary directory, some .gitignored directory)?

Similar comment on the metadata_uri.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful for the user to be able to view the generated files if required. But I'll go ahead and add the generated files to the gitignore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the configurable args as well. Going to leave it empty for now since I don't think this demo needs any configuration, but this should help stay consistent with other demos.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on anthony's comment - why not use notebooks for demo? It seems to be pretty standard. For example:
https://github.com/Eventual-Inc/Daft/tree/main/tutorials
https://github.com/lancedb/vectordb-recipes/tree/main/tutorials/RAG-with_MatryoshkaEmbed-Llamaindex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the other existing examples in the deltacat repository, but from the links you shared notebooks look like the standard for user facing demos. Shouldn't take much to update them.

Comment on lines 78 to 81

# Step 5: Read data from feather file.
read_records = list(dataset.scan(QueryExpression()).to_arrow())
[print(record) for record in read_records]
Copy link
Collaborator

@flliver flliver Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closer! This violates pythonic behavior, instead of passing in an empty QueryExpression, just have a default value on the query expression so you can do dataset.scan().to_arrow().

Also, don't list(), generator already can do list, examples should always use the simplest, pythonic expression. i.e.

# Print records
for record in dataset.scan().to_arrow():
    print(record)

Is super clear/obvious/pythonic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean, didn't realize it was being treated as an iterator already. Will update it.

@anshumankomawar anshumankomawar marked this pull request as draft January 11, 2025 02:27
@anshumankomawar anshumankomawar marked this pull request as ready for review January 11, 2025 02:43
for record in dataset.scan().to_arrow():
print(record)
for record in dataset.scan().to_pydict():
print(record.values())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prints values for the record instead of py_arrow representation.


def run(**kwargs):
# Step 1: Create a simple 3x3 Parquet file using pyarrow
parquet_file_path = "./contacts.parquet"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: consider using pathlib here like Path.cwd() / contacts.parquet


"""
This demo showcases
1. How to create a dataset from a Parquet file using Deltacat.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that you are breaking it into these modular steps make me think this would be good to refactor as a notebook where each step is a cell

name="contacts",
file_uri=parquet_file_path,
metadata_uri=".",
merge_keys="id"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity check - merge keys here can either be a string or list of strings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup it can be either.

# Step 2: Load the Parquet file into a Dataset
dataset = dc.Dataset.from_parquet(
name="contacts",
file_uri=parquet_file_path,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does file path here only accept a single file? It would be more interesting to give an example using a glob path

Copy link
Collaborator Author

@anshumankomawar anshumankomawar Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does support a directory as well but I'm having issues with the current implementation. Will update the demo to support the file dir in a future update.

("is_active", dc.Datatype.bool())
])

# Step 4: Append two new records, including values for the new columns. The cool thing with deltacat datasets is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: consider block comment

@anshumankomawar anshumankomawar marked this pull request as draft January 14, 2025 00:33
@anshumankomawar anshumankomawar marked this pull request as ready for review January 14, 2025 00:34
.gitignore Outdated
Comment on lines 32 to 38

# Generated Files
**/.riv-meta-contacts/
**/contacts.parquet

# PyInstaller
# Usually these files are written by a python script from a template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do */.riv-meta- and **/*.parquet so we catch all the intermediate files.

Comment on lines 69 to 71
" merge_keys=\"id\" # specify the merge key column\n",
")\n",
"print(\"Loaded dataset from Parquet file.\")"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do comments that just restate the obvious. Either remove the comment or describe what the point of a merge key is.

Comment on lines 79 to 81
"cell_type": "markdown",
"source": "### Step 3: Add two new columns to the Dataset",
"id": "ec9db8f54290095b"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fields (columns)

Comment on lines 100 to 103
"source": [
"### Step 4: Append two new Records\n",
"The cool thing withdeltacat datasets is that deltacat will not attempt to\n",
"rewrite the existing Parquet file; instead, they will store additional data\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no capital on records, missing a space on withdeltacat

Comment on lines 111 to 113
"source": [
"# Open a new writer that will write new data to feather files\n",
"dataset_writer = dataset.writer(file_format=\"feather\")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obvious comment is obvious. Either remove or describe what a writer does and why you'd want one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off topic thought, I don't think we actually need a writer as an external class. I think we could just do dataset.write(new_rows). The only reason to have a writer is if you want to modify the output configuration of how data is written to the dataset. In that case I think we want to do something similar to add_schema(), where we add_writer() and that creates a new writer configuration and appends it to the metadata, then you can optionally do dataset.write(records, writer=<named_writer>). Then you can update the default writer for the dataset and it goes with the dataset, which matches our core tenet of the dataset is portable (and as part of that, defines its own writer configurations).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, don't do anything I said for that in this commit, just keep it in mind.

Copy link
Collaborator

@flliver flliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do final nit picks and merge it!

@anshumankomawar anshumankomawar merged commit 376a7ed into 2.0 Jan 14, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants