Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZstdFrameCompressor and ZstdFrameDecompressor for non-streaming API #46

Closed
wants to merge 3 commits into from

Conversation

mkitti
Copy link
Member

@mkitti mkitti commented May 14, 2024

Also add ZstdError exception type for more descriptive errors. ZstdError is only
implemented for the new compressors. They will be applied to the streaming compressors
for a breaking version bump.

Also add ZstdError exception type for more descriptive errors. ZstdError is only
implemented for the new compressors. They will be applied to the streaming compressors
for a breaking version bump.
Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 75.65217% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 42.85%. Comparing base (ad7288a) to head (2f697b7).

Files Patch % Lines
src/frameCompression.jl 76.08% 11 Missing ⚠️
src/frameDecompression.jl 80.00% 9 Missing ⚠️
src/libzstd.jl 68.18% 7 Missing ⚠️
src/compression.jl 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   33.01%   42.85%   +9.84%     
==========================================
  Files           5        7       +2     
  Lines         524      637     +113     
==========================================
+ Hits          173      273     +100     
- Misses        351      364      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mkitti mkitti requested a review from nhz2 May 14, 2024 08:27
@mkitti
Copy link
Member Author

mkitti commented May 14, 2024

This is meant to support JuliaIO/JLD2.jl#560 which requires the non-streaming compression API where blocks know their decompressed size.

@mkitti
Copy link
Member Author

mkitti commented May 14, 2024

Could you try to dev this with your JLD2.jl changes, @milankl?

@nhz2
Copy link
Member

nhz2 commented May 14, 2024

Could the ZSTD_CCtx_setPledgedSrcSize function in https://facebook.github.io/zstd/zstd_manual.html#Chapter5 be used to set the frame size and also keep the streaming API? Also is there a reason for ZstdFrameDecompressor? IIUC the regular ZstdDecompressor should work with both formats.

@nhz2
Copy link
Member

nhz2 commented May 14, 2024

Also, not directly related to this PR, but maybe there should be a new package for supporting non-streaming codecs with a simpler API like https://numcodecs.readthedocs.io/en/stable/abc.html

@milankl
Copy link

milankl commented May 14, 2024

After installing your branch here and my branch from JuliaIO/JLD2.jl#560, I get this

julia> using JLD2, CodecZstd, HDF5, H5Zzstd
 │ Package H5Zzstd not found, but a package named H5Zzstd is available from a
 │ registry.
 │ Install package?
 │   (@v1.10) pkg> add H5Zzstd
 └ (y/n/o) [y]: y
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package H5Zzstd [f6f2d980]:
 H5Zzstd [f6f2d980] log:
 ├─possible versions are: 0.1.0-0.1.1 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions: 0.1.0-0.1.1
 └─restricted by compatibility requirements with CodecZstd [6b39b394] to versions: uninstalled — no versions left
   └─CodecZstd [6b39b394] log:
     ├─possible versions are: 0.8.2 or uninstalled
     └─CodecZstd [6b39b394] is fixed to version 0.8.2

Not sure I can make sense of this?

EDIT: This is because with

(@v1.10) pkg> st H5Zzstd
Status `~/.julia/environments/v1.10/Project.toml`
  [f6f2d980] H5Zzstd v0.1.1

The CodecZstd version is limited to

(@v1.10) pkg> status --outdated
Status `~/.julia/environments/v1.10/Project.toml`
⌅ [6b39b394] CodecZstd v0.7.2 (<v0.8.2): H5Zzstd

@milankl
Copy link

milankl commented May 14, 2024

IIUC the regular ZstdDecompressor should work with both formats.

I can test this @nhz2.

EDIT: Yes, works as before so doesn't matter whether setting ZstdDecompressor or ZstdFrameDecompressor in JLD2, see JuliaIO/JLD2.jl#560

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.6101634539467033

julia> save("test_with_zstd_compression.jld2", "A", A, compress=ZstdFrameCompressor())

julia> A == load("test_with_zstd_compression.jld2", "A")
true

@milankl
Copy link

milankl commented May 14, 2024

@mkitti this works:

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.1653492903631878

julia> save("test_with_zstd_compression.jld2", "A", A, compress=ZstdFrameCompressor())

julia> A == load("test_with_zstd_compression.jld2", "A")
true

while compres=ZstdCompressor() throws an unsupported error as expected...
And decompression via HDF5 too (in another environment due to the pkg conflicts above)

julia> using HDF5, H5Zzstd

julia> h5open("test_with_zstd_compression.jld2") do h5f
          h5f["A"][]
       end
1000×1000 Matrix{Float64}:
 0.165349  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
                                                                       
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

@mkitti
Copy link
Member Author

mkitti commented May 14, 2024

Also, not directly related to this PR, but maybe there should be a new package for supporting non-streaming codecs with a simpler API like https://numcodecs.readthedocs.io/en/stable/abc.html

At the end of the day, I think the public API for transcoding streams is actually simpler and more flexible. I'm involved over in numcodecs as well actually.

Transcoding streams is simpler in that there is only transcode rather than encode and decode. However, we can also stream.

The implementation here is trickier initially, but ultimately but can be manipulated for a lot of circumstances.

You are correct that the non-streaming API is basically just a special case of the streaming API. The non-streaming "frame" API assumes

  1. We have all the bytes needed immediately available.
  2. We know the total length of the result.

An important consequence of the above is that we can do a single allocation for the result whereas for streaming we allow for multiple allocations to extend the buffers as needed.

@milankl
Copy link

milankl commented May 14, 2024

I don't have any opinion on the streaming vs non-streaming API... I just would like to get his here merged so that we can merge JuliaIO/JLD2.jl#560 and have Zstd compression in JLD2 😉

@nhz2
Copy link
Member

nhz2 commented May 14, 2024

Can this be implemented as a new pledgedSrcSize keyword argument with a default value of ZSTD_CONTENTSIZE_UNKNOWN when constructing the ZstdCompressor? The docs for ZSTD_getFrameContentSize say decompressed size is an optional field.

error[] = ZstdError(code)
return 0, 0, :error
else
return Int(input.size), Int(code), :end
Copy link
Member

@nhz2 nhz2 May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think process can return :end here. Generally when compressing, :end is only returned if input is empty and no extra output needs to be written for the current frame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are either going to return :end or :ok here as in compression.jl:

return Δin, Δout, input.size == 0 && code == 0 ? :end : :ok

Here, unlike the streaming API, we consumed the entire input buffer and have written the entire output by invoking ZSTD_compress2. There is no continuation. There is no more input to process.

By the completion of ZSTD_compress2 there are two possible outcomes. We have either successfully compressed the data into the output buffer. How else would you describe the state after ZSTD_compress2 runs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller of process is allowed to break up the input data because process is a streaming API. Those small inputs could be appended to an internal buffer in the codec, and then only after input.size is set to zero, signaling there are no more bytes in the frame, ZSTD_compress2 is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the best way to detect additional bytes added and throw an error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to assume there will be additional bytes in the frame until process is called with input size zero.
There should be a new process2 similar to ZSTD_compressStream2 with an additional endOp argument, to allow one-shot compression of a frame.

@mkitti
Copy link
Member Author

mkitti commented May 14, 2024

Can this be implemented as a new pledgedSrcSize keyword argument with a default value of ZSTD_CONTENTSIZE_UNKNOWN when constructing the ZstdCompressor? The docs for ZSTD_getFrameContentSize say decompressed size is an optional field.

If we pursue this path, we would need to figure out how to modify JLD2.jl as well to work with this.

@nhz2
Copy link
Member

nhz2 commented May 14, 2024

Okay, I took a look at JuliaIO/JLD2.jl#560 and why it would be nicer to have a special ZstdFrameCompressor that matches what HDF5 is doing. Maybe adding an internal buffer to build up the frame before calling ZSTD_compress2 when input size is zero would work well. I think the same approach could be done to add Blosc and LZ4.

@mkitti
Copy link
Member Author

mkitti commented May 19, 2024

I'm starting to favor the approach in #49 where we might be able to get the streaming API to save the encoded size in the frame by providing endOp = :end. I'm not sure how to access that keyword via the TranscodingStreams API yet.

We may end using a similar API that exists here. For example, ZstdFrameCompressor() would create a ZstdCompressor that would initially pass endOp = :end.

Alternatively, we try the approach in JuliaIO/TranscodingStreams.jl#215.

@mkitti
Copy link
Member Author

mkitti commented May 29, 2024

Closing in favor of #52 .

@mkitti mkitti closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants