Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add default compression to write_dataframe function to compress dl2 #1165

Merged
merged 10 commits into from
Oct 5, 2023

Conversation

vuillaut
Copy link
Member

fixes #1163

@codecov
Copy link

codecov bot commented Sep 18, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (bb39662) 73.97% compared to head (c3cc978) 73.97%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1165   +/-   ##
=======================================
  Coverage   73.97%   73.97%           
=======================================
  Files         124      124           
  Lines       12647    12647           
=======================================
  Hits         9356     9356           
  Misses       3291     3291           
Files Coverage Δ
lstchain/io/io.py 77.82% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@moralejo
Copy link
Collaborator

Well spotted... indeed DL2 are really bulkier than the corresponding DL1b, funny we have not noticed.
Do you know (or can test) how much the default compression level impacts reading speed?

@vuillaut
Copy link
Member Author

vuillaut commented Sep 20, 2023

Well spotted... indeed DL2 are really bulkier than the corresponding DL1b, funny we have not noticed. Do you know (or can test) how much the default compression level impacts reading speed?

Hi @moralejo

I just did the test and here are the results.
From these, I would actually advocate for a default compression level = 1, the impact on file size and reading time being marginal, but the writing time increases gradually.
What do you think?

Compression level File size (MB) Write time (s) Read time (s)
0 406.6 4.6 1.2
1 255.7 11.0 3.8
2 254.8 11.3 3.9
3 254.1 12.2 3.9
4 252.9 13.1 3.9
5 252.5 14.8 4.0
6 251.9 17.6 4.3
7 251.8 20.5 4.2
8 251.7 33.1 4.5
9 251.7 41.8 4.4

@maxnoe
Copy link
Member

maxnoe commented Sep 20, 2023

@vuillaut Did you test that locally or on the cluster? Because I'd suspect that write and read speed actually would go up when writing much less data to the slowish network file system on the cluster.

@vuillaut
Copy link
Member Author

@vuillaut Did you test that locally or on the cluster? Because I'd suspect that write and read speed actually would go up when writing much less data to the slowish network file system on the cluster.

Indeed, I tested locally on my laptop, let me run this on the cluster.

@vuillaut
Copy link
Member Author

vuillaut commented Sep 20, 2023

I updated the table in my previous comment with numbers from the test at la palma

@morcuended
Copy link
Member

+1 to setting the compression level to 1 by default

@vuillaut
Copy link
Member Author

I have set default complevel=1 and redone the test with bloc:zstd as complib.

Compression level File size (MB) Write time (s) Read time (s)
0 406.70 4.27 1.38
1 268.55 4.27 1.90
2 263.42 6.13 1.70
3 262.20 7.56 1.73
4 260.82 9.55 1.70
5 260.27 12.03 1.67
6 260.08 25.49 1.81
7 259.62 38.49 1.72
8 257.56 51.92 1.71
9 257.47 90.88 1.77

(I don't think the read time are too relevant, they probably fluctuate too much with cluster usage)

I think this is ready for review.

Copy link
Member

@morcuended morcuended left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I left some comments

lstchain/io/io.py Outdated Show resolved Hide resolved
lstchain/io/io.py Outdated Show resolved Hide resolved
morcuended
morcuended previously approved these changes Oct 3, 2023
Copy link
Member

@morcuended morcuended left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@vuillaut
Copy link
Member Author

vuillaut commented Oct 3, 2023

Hey @morcuended
Sorry I did not see you review!
Your comments made me think that maybe we never tried the different comp levels, so I made the test for images as well (the impact for other data is less important IMO).

Compression level File size (MB) Write time (s) Read time (s)
0 939.11 11.30 0.96
1 483.43 4.58 1.60
2 482.69 5.69 1.65
3 482.31 8.29 1.64
4 481.49 8.29 1.59
5 481.54 7.82 1.52
6 481.11 14.99 1.69
7 481.29 25.21 1.51
8 481.85 40.10 1.58
9 481.06 216.87 1.72

I'd say the conclusion is the same as for parameters, so I did change the complevel to 1 as default for everyone.

I also simplified the write_dataframe function to use ctapipe write_table.

Could you review again, please?

lstchain/io/io.py Outdated Show resolved Hide resolved
@vuillaut
Copy link
Member Author

vuillaut commented Oct 5, 2023

Thanks Daniel!

@vuillaut vuillaut merged commit 675b245 into main Oct 5, 2023
@vuillaut vuillaut deleted the compress_dl2 branch October 5, 2023 14:11
@moralejo
Copy link
Collaborator

Apologies for missed review and thanks for this @vuillaut !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compress dl2 data
4 participants