-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-allocate buffer #422
Pre-allocate buffer #422
Conversation
Waiting on release of |
any thoughts on this? I know you were aware of this issue, and thanks a lot for looking into quite a few issues here and in the TransdocingStream |
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399
697df28
to
383d0fb
Compare
I have some thoughts. One solution to "my dataset is larger than memory" is partitioning. If your dataset is partitioned in such a way that each partition fits in memory, you can iterate it with
You can do this right now without requiring any additional features from this package. In contrast what is discussed in #340 (which is: don't decompress if you don't have to) is a different approach, but doesn't yet exist. Currently I have some commits that add the feature to multi-thread decompression at the buffer level. I will be trying to upstream what I have so far. The difficulty is that these commits touch a lot of code, so this won't happen overnight. I imagine couple of weeks? On top of that it should be straightforward to implement what is discussed in #340. |
the problem of this approach is it's decompressing every column. Consider examples such as: Decompressing every column would be super slower if I'm only using a small % of columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@quinnj / @ericphanson - any comments before we merge? |
nope, LGTM! |
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc.
Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode().
Inspired by #399