-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add demo for rivulet dataset conversion #444
Conversation
|
||
# Step 5 (Optional): Read data from feather file. | ||
#read_feather = feather.read_feather('./.riv-meta-contacts/data/<replace with generated file>') | ||
#print(read_feather) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right, because it's asking you to stick your nose into the private .riv-meta files. You want to either do a dataset.scan() and get back the now merged files and print those, or you want to do a (TBD) dataset.export(file-format=parquet/feather), then read that exported file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, have an example of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense hadn't run into the scanner yet. Will update it to use the scanner for now (will update with the dataset export feature in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to structure the demos?
If not as a notebook, it looks like the existing examples are written following an existing pattern with configurable args that I think we can replicate here.
import deltacat as dc | ||
|
||
# Step 1: Create a simple 3x3 Parquet file using pyarrow | ||
parquet_file_path = "../contacts.parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: just as a matter of convenience, can we avoid creating these files within directories that are visible to the git repo (e.g., a temporary directory, some .gitignored directory)?
Similar comment on the metadata_uri
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful for the user to be able to view the generated files if required. But I'll go ahead and add the generated files to the gitignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the configurable args as well. Going to leave it empty for now since I don't think this demo needs any configuration, but this should help stay consistent with other demos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on anthony's comment - why not use notebooks for demo? It seems to be pretty standard. For example:
https://github.com/Eventual-Inc/Daft/tree/main/tutorials
https://github.com/lancedb/vectordb-recipes/tree/main/tutorials/RAG-with_MatryoshkaEmbed-Llamaindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the other existing examples in the deltacat repository, but from the links you shared notebooks look like the standard for user facing demos. Shouldn't take much to update them.
|
||
# Step 5: Read data from feather file. | ||
read_records = list(dataset.scan(QueryExpression()).to_arrow()) | ||
[print(record) for record in read_records] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closer! This violates pythonic behavior, instead of passing in an empty QueryExpression, just have a default value on the query expression so you can do dataset.scan().to_arrow().
Also, don't list(), generator already can do list, examples should always use the simplest, pythonic expression. i.e.
# Print records
for record in dataset.scan().to_arrow():
print(record)
Is super clear/obvious/pythonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean, didn't realize it was being treated as an iterator already. Will update it.
for record in dataset.scan().to_arrow(): | ||
print(record) | ||
for record in dataset.scan().to_pydict(): | ||
print(record.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prints values for the record instead of py_arrow representation.
cdd50d0
to
62b038e
Compare
|
||
def run(**kwargs): | ||
# Step 1: Create a simple 3x3 Parquet file using pyarrow | ||
parquet_file_path = "./contacts.parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: consider using pathlib here like Path.cwd() / contacts.parquet
|
||
""" | ||
This demo showcases | ||
1. How to create a dataset from a Parquet file using Deltacat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact that you are breaking it into these modular steps make me think this would be good to refactor as a notebook where each step is a cell
name="contacts", | ||
file_uri=parquet_file_path, | ||
metadata_uri=".", | ||
merge_keys="id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sanity check - merge keys here can either be a string or list of strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup it can be either.
# Step 2: Load the Parquet file into a Dataset | ||
dataset = dc.Dataset.from_parquet( | ||
name="contacts", | ||
file_uri=parquet_file_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does file path here only accept a single file? It would be more interesting to give an example using a glob path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does support a directory as well but I'm having issues with the current implementation. Will update the demo to support the file dir in a future update.
("is_active", dc.Datatype.bool()) | ||
]) | ||
|
||
# Step 4: Append two new records, including values for the new columns. The cool thing with deltacat datasets is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: consider block comment
c079e99
to
706422b
Compare
.gitignore
Outdated
|
||
# Generated Files | ||
**/.riv-meta-contacts/ | ||
**/contacts.parquet | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do */.riv-meta- and **/*.parquet so we catch all the intermediate files.
" merge_keys=\"id\" # specify the merge key column\n", | ||
")\n", | ||
"print(\"Loaded dataset from Parquet file.\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do comments that just restate the obvious. Either remove the comment or describe what the point of a merge key is.
"cell_type": "markdown", | ||
"source": "### Step 3: Add two new columns to the Dataset", | ||
"id": "ec9db8f54290095b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fields (columns)
"source": [ | ||
"### Step 4: Append two new Records\n", | ||
"The cool thing withdeltacat datasets is that deltacat will not attempt to\n", | ||
"rewrite the existing Parquet file; instead, they will store additional data\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no capital on records, missing a space on withdeltacat
"source": [ | ||
"# Open a new writer that will write new data to feather files\n", | ||
"dataset_writer = dataset.writer(file_format=\"feather\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obvious comment is obvious. Either remove or describe what a writer does and why you'd want one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off topic thought, I don't think we actually need a writer as an external class. I think we could just do dataset.write(new_rows). The only reason to have a writer is if you want to modify the output configuration of how data is written to the dataset. In that case I think we want to do something similar to add_schema(), where we add_writer() and that creates a new writer configuration and appends it to the metadata, then you can optionally do dataset.write(records, writer=<named_writer>). Then you can update the default writer for the dataset and it goes with the dataset, which matches our core tenet of the dataset is portable (and as part of that, defines its own writer configurations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, don't do anything I said for that in this commit, just keep it in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do final nit picks and merge it!
Summary
This demo demonstrates creating a dataset using Deltacat, starting with a Parquet file and expanding it with additional columns and records. New data (with expanded schema) is appended without modifying the original Parquet file, enabling efficient updates. Finally, the dataset is exported to feather files.
Rationale
Showcase current functionality and provide an example for users.
Changes
Added demo script to deltacat examples folder (i.e. deltacat/examples/rivulet/).
Impact
No impact to existing code.
Testing
Unit tests (
make test
).Regression Risk
Checklist
Unit tests covering the changes have been added
E2E testing has been performed