Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger file size with compression than without #625

Closed
jbrea opened this issue Dec 18, 2024 · 3 comments
Closed

Larger file size with compression than without #625

jbrea opened this issue Dec 18, 2024 · 3 comments

Comments

@jbrea
Copy link

jbrea commented Dec 18, 2024

I was puzzled to find jld2 files with compression to be quite often larger than those without: 2.8M versus 2.3M in the code below (I discovered this, when I wanted to store an array of DataFrames). For the data below, zip 3.0 reduces the size to 1.2M, which is close to the file size JLD2 achieves when the data is completely flat.

julia> using Pkg

julia> Pkg.activate(temp = true);
  Activating new project at `/tmp/jl_BE21K1`

julia> Pkg.add(["JLD2", "CodecZlib"])

julia> Pkg.status()
Status `/tmp/jl_BE21K1/Project.toml`
  [944b1d66] CodecZlib v0.7.6
  [033835bb] JLD2 v0.5.10

julia> using JLD2, CodecZlib

julia> data = [[randn(10) for _ in 1:120] for _ in 1:100];

julia> fname = tempname() * ".jld2";

julia> jldsave(fname, false; data);

julia> run(`du -h $fname`);
2.3M	/tmp/jl_uRqrgLkncj.jld2

julia> fname_compressed = tempname() * ".jld2";

julia> jldsave(fname_compressed, true; data);

julia> run(`du -h $fname_compressed`);
2.8M	/tmp/jl_aRgVstKzmD.jld2

julia> fname_compressed_flat = tempname() * ".jld2";

julia> jldsave(fname_compressed_flat, true; data = vcat(vcat(data...)...));

julia> run(`du -h $fname_compressed_flat`);
904K	/tmp/jl_WHSFsXIBQX.jld2

julia> zipname = tempname() * ".zip";

julia> run(`zip $zipname $fname_compressed`);
  adding: tmp/jl_aRgVstKzmD.jld2 (deflated 59%)

julia> run(`du -h $zipname`);
1.2M	/tmp/jl_kHETil8d6r.zip

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 5850U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
@JonasIsensee
Copy link
Collaborator

JonasIsensee commented Dec 18, 2024

Hi @jbrea,

what you are seeing is because your test data is not really compressible ( with lossless compression ).
(random floating point numbers have a lot of entropy)

When you use

julia> data = [[randn(10) for _ in 1:120] for _ in 1:100];
julia> jldsave(fname_compressed, true; data);

JLD2 will try to individually compress each of the 100*120 short arrays. As said above, they can't really be compressed and on top of that a bit of metadata for the compression library has to be added as well.

When you use the "external" compression, what you are seeing is the compression of the JLD2 metadata.
(For every individual array, there is some metadata describing the element type and the shape and so on. That is exactly the same 12000 times)

For comparison: Here's what you get with a modified example with 10 random integers between 1:10

julia> data = [[rand(1:10, 10) for _ in 1:120] for _ in 1:100]; # 10 random integers in the interval 1:10

julia> fname = tempname() * ".jld2";

julia> jldsave(fname, false; data);

julia> run(`du -h $fname`);
2.2M	/tmp/jl_l6bg9sbqrY.jld2

julia> fname_compressed = tempname() * ".jld2";

julia> jldsave(fname_compressed, true; data);

julia> run(`du -h $fname_compressed`);
2.0M	/tmp/jl_f4la5I3a7Q.jld2

julia> zipname = tempname() * ".zip";

julia> run(`zip $zipname $fname_compressed`);
  adding: tmp/jl_f4la5I3a7Q.jld2 (deflated 81%)

julia> run(`du -h $zipname`);
372K	/tmp/jl_Zov0zCfy1i.zip

julia> fname_compressed_flat = tempname() * ".jld2";

julia> jldsave(fname_compressed_flat, true; data = vcat(vcat(data...)...));

julia> run(`du -h $fname_compressed_flat`);
88K	/tmp/jl_BWDcvgvCX6.jld2

julia> fname_uncompressed_flat = tempname() * ".jld2";

julia> jldsave(fname_uncompressed_flat, false; data = vcat(vcat(data...)...));

julia> run(`du -h $fname_uncompressed_flat`);
940K	/tmp/jl_JGJqdqjeH8.jld2

julia> run(`zip $zipname $fname_uncompressed_flat`);
  adding: tmp/jl_JGJqdqjeH8.jld2 (deflated 91%)

julia> run(`du -h $zipname`);
460K	/tmp/jl_Zov0zCfy1i.zip

Here, you can see that the most efficient way is to use JLD2 compression with a single flattened dataset.
When working with floating point number, I would not recommend using compression.
There is significant compute involved and the compression level is usually not sufficiently large.

@jbrea
Copy link
Author

jbrea commented Dec 19, 2024

I see, thanks for the explanation. Would it therefore make sense to check if the metadata can be compressed, when jldsave is called with compression?

@JonasIsensee
Copy link
Collaborator

No, that does not really make sense. JLD2 should always be able to open a file even without having compression libraries installed. (It may not be able to read the dataset but it can at least say, what library needs to be loaded to do so.)

Of course, you can always try to externally compress the whole file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants