-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PECO-953] Optimize CloudFetchResultHandler memory consumption #204
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ltsHelper into provider of TRowSet Signed-off-by: Levko Kravets <[email protected]>
…vider interface Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
…size Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
kravets-levko
requested review from
arikfr,
superdupershant,
yunbodeng-db,
susodapop,
nithinkdb and
andrefurlan-db
as code owners
November 22, 2023 10:56
kravets-levko
temporarily deployed
to
azure-prod
November 28, 2023 12:18 — with
GitHub Actions
Inactive
rcypher-databricks
approved these changes
Nov 28, 2023
Merged
kravets-levko
temporarily deployed
to
azure-prod
November 30, 2023 20:57 — with
GitHub Actions
Inactive
kravets-levko
changed the title
Optimize CloudFetchResultHandler memory consumption
[PECO-953] Optimize CloudFetchResultHandler memory consumption
Nov 30, 2023
nithinkdb
approved these changes
Dec 4, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PECO-953
Initially, we implemented CloudFetch result handler on top of Arrow result handler. The problem with this solution was that Arrow result handler operates with batches returned directly in
TRowSet
s, which are small (maximum tens of kilobytes) and contain few hundred records each. So Arrow handler was just immediately unpacking and processing the whole batch.On the other hand, CloudFetch results are stored in files of 10MB and more, and attempt to unpack the whole file lead to quite intensive memory usage (in some cases up to 1GB Nodejs RSS / 700MB heap).
The solution is to not unpack the whole received file. Instead, we read and process batches one by one. Each of them is small (approximately like batches returned in
TRowSet
), and usually by the time when next batch is requested - previous one is no longer needed, so Node can collect and reuse this memory easily.Profiler reports before and after the changes (10'000'000 records two fields each, 1 concurrent download for better visibility)
Before
After
Notes for reviewers
Previously
ArrowResultHandler
was collecting arrow batches and converting them to objects,CloudFetchResultHandler
was inherited fromArrowResultHandler
and was overriding batch collecting method.Now,
ArrowResultHandler
andCloudFetchResultHandler
are separated. Both just collect raw (binary) arrow batches - each using own way - and pass them toArrowResultConverter
.ArrowResultConverter
contains data conversion code that previously was inArrowResultHandler
, but uses new mechanism to unpack binary batches (old one was reading all records at once, new one reads them one my one).Tests were mostly updated to reflect those changes, no much new code added there.