-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of Zstd compression #2937
Comments
I'm also not happy with the extra steps needed, and now there is a HDF5 function which allows us to control the filter path programmatically, which means we can solve this whole problem. But for now, set HDF5_PLUGIN_DIR, or else you can accept the default plugin install and then you don't have to set anything. Unfortunately, I don't know the details for CMake, but for autoconf, I think you use --with-plugin-dir with no argument, and that will use the default location. I will try to swing around to this code and make this easier in a future release. Make sure you also take a look at the quantize feature, which can really improve compression sizes and speeds. |
The underlying problem is that we would need to create a list of compressors |
We decide collectively, based on our judgement of what is most useful and can be sustainably supported. Just as we decided to include zstandard. Criteria include:
I was at the HDF5 workshop for particle physics teams, and they were all using lz4 because it was so much faster. So that's the next one I'll look at. Fortunately John provided a lz4 class for netcdf-java. Recall that the CCR project exists to prototype and explore. So I would suggest that we deal with zstandard today, and if more compressors are to be built-in, we deal with that on a case-by-case basis, having thoroughly tested our ideas in CCR. But the problems with zstandard are not in the API, but in the configure and initialization. We need to make that easier for Greg and others like him... |
@DennisHeimbigner are you already using BLOSC or LZ4 for Zarr stuff? |
We use BLOSC for zarr/nczarr. |
Yes, I thought I was doing that, but still had to define the environment variable. Will look again to see what I was missing during build process to make this work.
Yes, that is working and was very simple to get going. Thanks for all the work you all are doing on netCDF. |
@gsjaardema I would be really interested in any final results you get for the new compression methods - that is, percent faster, or percent improvement in compressed size... |
I havent' been able to get netCDF/HDF5 to find the plugins unless I specify the |
What value do you assume for the default plugin location? |
The library is configured with the location of the local HDF5 plugin directory and that is correctly echoed from
If I just run my executable, it doesn't find the Zstd compression filter with 4.9.2 or with
Then it correctly finds the Zstd compression filter... |
OK, so the way it should work (but apparently does not) is that if you keep your plugins in the directory you told configure, you should not have to set the environment var... |
Based on a reading of docs/filters.md, it looks like you need to set the environment variable at runtime: (my highlighting)
I can somewhat control this from my application which would do the writing, but then (based on minimal attempts), it looks like any downstream application that wants to read my file that has zstd compression in it will also have to set the plugin path environment variable or it will fail to read the file. I would like to use zstandard or even some of the other filters, but currently I think I will be setting myself up for lots of complaints from people who write files with compressed variables and then try to read the file a day/week/month later and have no idea why it fails. Ideally:
This is not a complaint, I really appreciate the work that has gone into this and definitely want to use it... |
Maybe I am misreading / misinterpreting item 4 in the docs/filters.md I quoted above, but that seems to require setting HDF5_PLUGIN_PATH at runtime no matter what... |
Closely related issues: #2753 |
One problem we are up against is that if we use the H5PLxxx API |
Thanks for the reminder about |
In Issue #2753, I appear to have promised |
I also note that we may have an option to completely bypass the HDF5 dynamic loading |
@gsjaardema I agree this has to be fixed. But how? |
@edwardhartnett I'm not sure what the best solution is. Just looking at Zstd, the "easy" solution seems to be to treat it the same as zlib, quantize, shuffle, and szip -- compile it directly into the library and not rely on any plugin paths or other runtime loading. If it is there at build time, it is there at runtime. This doesn't scale well since then how do you handle blosc or Z123 or the next five ultimate compression libraries... So I think there is also the issue of how to handle plugins in general... The difficulty for my usage is that I want to be able to query at my build time what capabilities are available in netCDF, HDF5, CGNS, matio, and maybe the other libraries I use and then decide in my code how to build my libraries and what capabilities to expose/support and then my libraries are used in other applications. If Zstd is advertised as supported by netCDF, then I should just be able to link with netCDF and support Zstd instead of having to wonder if something will happen at runtime that will cause Zstd to not be available. There is enough difficulty in making sure the entire tool chain on multiple hosts will all have netCDF libraries that support Zstd, quantize, and other features without adding on the issue that this could all be tested at build/install time, but then fail at runtime... If plugins are the way to do it, then I would like to have the plugin directory that I specify at build time to be searched at runtime without me having to specify anything at runtime. If something does change, then specifying HDF5_PLUGIN_PATH or some other environment variable is helpful and the capability to be able to add new capabilities through plugins is nice to have... I think for Zstd since there is an explicit There is still the difficulty of using a new feature that does not exist in older versions of the library. Quantize is nice since it is done at write time and does not need to be supported in the applications that are reading the file. Compression is harder since it is needed at both write and read time, so the entire toolchain needs to be updated to support this once it becomes available to write fiels using it and it is difficult since an older library doesn't even know what Zstd is, so can't give a meaningful error message about a feature created after the reading library was installed... (We still get some random failures at times when a users path points to an older netCDF application that doesn't know about netcdf-4 or some other feature that has existed for an eternity...) So not sure my rambling has any solutions or recommendations in it. It is a hard problem and for read/write libraries the problem is even harder since the need for the capability (zstd, netcdf-4) follows the file which can move among multiple hosts and be consumed by applications that are not always under out control... |
Currently, the HDF5 API is not very good at exporting that info. NetCDF is doable. Do not know about the other libraries you mention. |
I am going to try to start tackling this piecemeal. |
Question: when you built libhdf5, did you set the option
|
@gsjaardema did you get this resolved? I have just made a bunch of changes for next release to make this work a little better, and to document it. Hopefully that will make it easier for future users. If there's no remaining problem, please close this issue. |
I will try to look at it this week. Thanks for all the work you and Dennis did on this. |
OK, I am trying
I then simply did a
And then the Configuration Summary:
So it looks like the plugin installation directory is not being persisted and gets reset on a subsequent reconfigure. |
This is correct and expected behavior; expected in that I've just been working in that part of the code and, indeed, that is what the logic dictates should happen. I'm open to the discussion about having the cached value used if it is set! |
Should it be cached? |
If I have configured the CMake build and edit the CMakeCache.txt to, for example, change the build from The
I have ended up with a directory named "YES" on some of the builds. If I explicitly set Appreciate all the work being done in this area, but just giving some feedback on some non-intuitive (at least to me) behavior I am seeing. |
I use the CMake build, so use:
I was not using this at the beginning, but started at some point in this process... |
OK, but what specifically can I do to make it better. Cache HDF5_DEFAULT_PLUGINDIR? |
Something related to the plugindir should be cached such that if I configure and then It is unclear what |
I also see that there are two listings of the plugin directory in the libnetcdf.settings:
In my configurations/builds, both of those show the same value which I assume is correct, but I'm unclear what to do if they ever differ and if they are always supposed to be the same, should one of them be removed and hopefully simplify some logic somewhere. |
OK, let me take a look at this... |
I'm a little late to the party, but have been starting to look at using the Zstd library for compression in netCDF.
Am I misunderstanding something, or do I have to explicitly set the environment variable
HDF5_PLUGIN_DIR
to the location of the directory containing the filter for zstd prior to running an application that wants to use Zstd compression?It seems like using Zstd instead of zlib is making me jump through lots of hoops instead of "just working" like zlib has/does... Thinking that maybe I am missing something...
The text was updated successfully, but these errors were encountered: