Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ZstdZarrCompressor #149

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mkitti
Copy link
Member

@mkitti mkitti commented Jun 25, 2024

This implements ZstdZarrCompressor which wraps around CodecZstd as a package extension.

Part of the complication of using package extensions is getting a reference to new types defined in the extension. I created a mechanism by which you could specify the compressor as a string, which would then lookup the type from a dictionary.

I'm also wondering if there might be a general way to wrap TranscodingStreams codecs into Zarr compressors.

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9654302116

Details

  • 27 of 34 (79.41%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 88.316%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Compressors.jl 4 11 36.36%
Files with Coverage Reduction New Missed Lines %
src/Compressors.jl 1 83.08%
Totals Coverage Status
Change from base Build 8981180163: -0.3%
Covered Lines: 839
Relevant Lines: 950

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9655786902

Details

  • 25 of 32 (78.13%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.4%) to 88.291%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Compressors.jl 4 11 36.36%
Files with Coverage Reduction New Missed Lines %
src/Compressors.jl 1 83.08%
Totals Coverage Status
Change from base Build 8981180163: -0.4%
Covered Lines: 837
Relevant Lines: 948

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9656097799

Details

  • 26 of 32 (81.25%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 88.397%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Compressors.jl 5 11 45.45%
Files with Coverage Reduction New Missed Lines %
src/Compressors.jl 1 84.62%
Totals Coverage Status
Change from base Build 8981180163: -0.3%
Covered Lines: 838
Relevant Lines: 948

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9657292182

Details

  • 32 of 32 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.5%) to 89.135%

Totals Coverage Status
Change from base Build 8981180163: 0.5%
Covered Lines: 845
Relevant Lines: 948

💛 - Coveralls

@mkitti
Copy link
Member Author

mkitti commented Jun 25, 2024

An alternative to the string lookup for the compressor, would be to just pass in CodecZstd.ZstdCompressor directly, specifically an instance created by CodecZstd.ZstdFrameCompressor(). Via a conversion mechanism, we could wrap that into a Zarr.Compressor.

Comment on lines +16 to +19
struct ZstdZarrCompressor <: Zarr.Compressor
compressor::CodecZstd.ZstdCompressor
decompressor::CodecZstd.ZstdDecompressor
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't support multithreaded use IIUC. I think this should be like

Zarr.jl/src/Compressors.jl

Lines 129 to 131 in f436713

struct ZlibCompressor <: Compressor
clevel::Int
end
where the struct only contains the parameters of the codec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wrote it like this, but then I was thinking about all the other potential parameters, even if they do not need to be serialized. I think what we should implement is the ability to copy a compessor.

Frankly, I'm somewhat confused about why one actually needs to serialize the compression level into the array metadata. You do not need that information to decompress the data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand this correctly, but what would happen in a scenario where a user opens an existing array and wants to add some new data? Of course one can set a different compression level for the new chunks, but for consistency of the dataset I think it is good to write all compression parameters to the metadata struct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this implementation respects all this and other compressors in Zarr.jl currently don't work multithreaded as well so ok from my side

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the thread safety issues, this also leaks memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other potential parameters what about something like:
https://github.com/nhz2/ChunkCodecs.jl/blob/799b154bd400633f0ae3bd1cf78d0cc95957f2cf/ChunkCodecLibZstd/src/encode.jl#L21-L25

struct ZstdEncodeOptions <: EncodeOptions
    compressionLevel::Cint
    checksum::Bool
    advanced_parameters::Vector{Pair{Cint, Cint}}
end

Where the advanced parameters are set with ZSTD_CCtx_setParameter after the compression level and checksum options are set.

@lazarusA
Copy link

bump

@nhz2
Copy link
Member

nhz2 commented Dec 19, 2024

I've been working on this in https://github.com/nhz2/ChunkCodecs.jl/tree/main/ChunkCodecLibZstd


if compressor isa AbstractString
if haskey(compressortypes, String(compressor))
compressor = compressortypes[compressor]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make it impossible to set custom compression levels for the compression algorithm. Do we need another keyword argument for zcreate that gets passed to the compressor constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that the simple option of just passing a string will give you default compression options. If you want to specify the compression level, you can use the compression constructor and pass the instatiated compressor instance.

Copy link
Collaborator

@meggart meggart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and sorry for missing it for such a long time. Probably we need to rebase and test this again. @mkitti in case you don't have the time right now I can try to rebase as well. Just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants