-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add kwarg chunksize for default data partitioning for write #400
base: main
Are you sure you want to change the base?
Conversation
I've changed the condition for automatic partitioning to be In addition, I've changed |
Okay, I was wrong. I misunderstood what I've moved back to Tables.rows to ensure we get rows out. I'm not sure what the best solution is here. EDIT:
|
Added compat for DataFrames via Extras |
We could allow users to optionally provide the Schema in the Base.open constructor of the Writer object. If a user makes use of this then we should validate the the actual schema of each chunk matches that of the expected schema. |
""" | ||
function write end | ||
|
||
write(io_or_file; kw...) = x -> write(io_or_file, x; kw...) | ||
|
||
function write(file_path, tbl; kwargs...) | ||
function write(file_path, tbl; chunksize::Union{Nothing,Integer}=64000, kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think chunksize
should move to be a new field in Writer
with default kwarg value set in the Base.open
constructor on L170. This would eliminate the code duplication.
if !isnothing(chunksize) && Tables.istable(tbl) && Tables.rowaccess(tbl) | ||
@assert chunksize >= 0 "chunksize must be >= 0" | ||
if hasmethod(Iterators.partition,(typeof(tbl),)) | ||
tbl_source = Iterators.partition(tbl, chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Iterators.partition
from Base rather than DataFrames
to prevent adding one more dependency?
https://docs.julialang.org/en/v1/base/iterators/#Base.Iterators.partition
This PR proposes to introduce automated partitioning of the provided tables when writing. It follows my findings from benchmarking against PyArrow
Nowadays, most machines are multithreaded and
Arrow.write()
provides multithreaded writing for partitioned data. However, a user must explicitly partition their data.Unfortunately, most users do not realize that both their write and subsequent read operations will not be multithreaded without such partitioning (there is an issue to improve the docs).
This PR defaults to partitioning data if it's larger than 64K rows (should be beneficial on most systems) to enable better Arrow.jl performance on both read and write.
Implementation:
chunksize
(maps to PyArrow and should be broadly understood)chunksize
of 64000 rows, as perPyArrow.write_feather
chunksize=nothing
partitioning is done viachanged toIterators.partition(Tables.rows(tbl),chunksize)
for all Tables.jl-compatible sources (checksTables.istable
)Iterators.partition(tbl,chunksize)
to avoid missingness getting lost (eg, for DataFrames)Some resources:
Iterators.partition
in 1.5 Release