Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PECO-953] Optimize CloudFetchResultHandler memory consumption #204

Merged
merged 15 commits into from
Dec 4, 2023

Conversation

kravets-levko
Copy link
Contributor

@kravets-levko kravets-levko commented Nov 21, 2023

PECO-953

Initially, we implemented CloudFetch result handler on top of Arrow result handler. The problem with this solution was that Arrow result handler operates with batches returned directly in TRowSets, which are small (maximum tens of kilobytes) and contain few hundred records each. So Arrow handler was just immediately unpacking and processing the whole batch.

On the other hand, CloudFetch results are stored in files of 10MB and more, and attempt to unpack the whole file lead to quite intensive memory usage (in some cases up to 1GB Nodejs RSS / 700MB heap).

The solution is to not unpack the whole received file. Instead, we read and process batches one by one. Each of them is small (approximately like batches returned in TRowSet), and usually by the time when next batch is requested - previous one is no longer needed, so Node can collect and reuse this memory easily.

  • Optimize CloudFetch result handler
  • Add/update tests

Profiler reports before and after the changes (10'000'000 records two fields each, 1 concurrent download for better visibility)

Before

image

After

image

Notes for reviewers

Previously ArrowResultHandler was collecting arrow batches and converting them to objects, CloudFetchResultHandler was inherited from ArrowResultHandler and was overriding batch collecting method.

Now, ArrowResultHandler and CloudFetchResultHandler are separated. Both just collect raw (binary) arrow batches - each using own way - and pass them to ArrowResultConverter. ArrowResultConverter contains data conversion code that previously was in ArrowResultHandler, but uses new mechanism to unpack binary batches (old one was reading all records at once, new one reads them one my one).

Tests were mostly updated to reflect those changes, no much new code added there.

@kravets-levko kravets-levko marked this pull request as ready for review November 22, 2023 10:56
Base automatically changed from fix-max-rows-behavior to main November 28, 2023 11:53
@kravets-levko kravets-levko mentioned this pull request Nov 29, 2023
@kravets-levko kravets-levko changed the title Optimize CloudFetchResultHandler memory consumption [PECO-953] Optimize CloudFetchResultHandler memory consumption Nov 30, 2023
Copy link
Contributor

@nithinkdb nithinkdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kravets-levko kravets-levko merged commit 5c5b87f into main Dec 4, 2023
5 checks passed
@kravets-levko kravets-levko deleted the optimize-cloudfetch-handler branch December 4, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants