-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ZstdZarrCompressor #149
base: master
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 9654302116Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9655786902Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9656097799Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9657292182Details
💛 - Coveralls |
An alternative to the string lookup for the compressor, would be to just pass in |
struct ZstdZarrCompressor <: Zarr.Compressor | ||
compressor::CodecZstd.ZstdCompressor | ||
decompressor::CodecZstd.ZstdDecompressor | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't support multithreaded use IIUC. I think this should be like
Lines 129 to 131 in f436713
struct ZlibCompressor <: Compressor | |
clevel::Int | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wrote it like this, but then I was thinking about all the other potential parameters, even if they do not need to be serialized. I think what we should implement is the ability to copy a compessor.
Frankly, I'm somewhat confused about why one actually needs to serialize the compression level into the array metadata. You do not need that information to decompress the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't understand this correctly, but what would happen in a scenario where a user opens an existing array and wants to add some new data? Of course one can set a different compression level for the new chunks, but for consistency of the dataset I think it is good to write all compression parameters to the metadata struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this implementation respects all this and other compressors in Zarr.jl currently don't work multithreaded as well so ok from my side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the thread safety issues, this also leaks memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other potential parameters what about something like:
https://github.com/nhz2/ChunkCodecs.jl/blob/799b154bd400633f0ae3bd1cf78d0cc95957f2cf/ChunkCodecLibZstd/src/encode.jl#L21-L25
struct ZstdEncodeOptions <: EncodeOptions
compressionLevel::Cint
checksum::Bool
advanced_parameters::Vector{Pair{Cint, Cint}}
end
Where the advanced parameters are set with ZSTD_CCtx_setParameter
after the compression level and checksum options are set.
bump |
I've been working on this in https://github.com/nhz2/ChunkCodecs.jl/tree/main/ChunkCodecLibZstd |
|
||
if compressor isa AbstractString | ||
if haskey(compressortypes, String(compressor)) | ||
compressor = compressortypes[compressor]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make it impossible to set custom compression levels for the compression algorithm. Do we need another keyword argument for zcreate
that gets passed to the compressor constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that the simple option of just passing a string will give you default compression options. If you want to specify the compression level, you can use the compression constructor and pass the instatiated compressor instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and sorry for missing it for such a long time. Probably we need to rebase and test this again. @mkitti in case you don't have the time right now I can try to rebase as well. Just let me know.
This implements ZstdZarrCompressor which wraps around CodecZstd as a package extension.
Part of the complication of using package extensions is getting a reference to new types defined in the extension. I created a mechanism by which you could specify the compressor as a string, which would then lookup the type from a dictionary.
I'm also wondering if there might be a general way to wrap TranscodingStreams codecs into Zarr compressors.