Larger file size with compression than without #625

jbrea · 2024-12-18T14:06:29Z

I was puzzled to find jld2 files with compression to be quite often larger than those without: 2.8M versus 2.3M in the code below (I discovered this, when I wanted to store an array of DataFrames). For the data below, zip 3.0 reduces the size to 1.2M, which is close to the file size JLD2 achieves when the data is completely flat.

julia> using Pkg

julia> Pkg.activate(temp = true);
  Activating new project at `/tmp/jl_BE21K1`

julia> Pkg.add(["JLD2", "CodecZlib"])

julia> Pkg.status()
Status `/tmp/jl_BE21K1/Project.toml`
  [944b1d66] CodecZlib v0.7.6
  [033835bb] JLD2 v0.5.10

julia> using JLD2, CodecZlib

julia> data = [[randn(10) for _ in 1:120] for _ in 1:100];

julia> fname = tempname() * ".jld2";

julia> jldsave(fname, false; data);

julia> run(`du -h $fname`);
2.3M	/tmp/jl_uRqrgLkncj.jld2

julia> fname_compressed = tempname() * ".jld2";

julia> jldsave(fname_compressed, true; data);

julia> run(`du -h $fname_compressed`);
2.8M	/tmp/jl_aRgVstKzmD.jld2

julia> fname_compressed_flat = tempname() * ".jld2";

julia> jldsave(fname_compressed_flat, true; data = vcat(vcat(data...)...));

julia> run(`du -h $fname_compressed_flat`);
904K	/tmp/jl_WHSFsXIBQX.jld2

julia> zipname = tempname() * ".zip";

julia> run(`zip $zipname $fname_compressed`);
  adding: tmp/jl_aRgVstKzmD.jld2 (deflated 59%)

julia> run(`du -h $zipname`);
1.2M	/tmp/jl_kHETil8d6r.zip

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 5850U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

JonasIsensee · 2024-12-18T16:23:40Z

Hi @jbrea,

what you are seeing is because your test data is not really compressible ( with lossless compression ).
(random floating point numbers have a lot of entropy)

When you use

julia> data = [[randn(10) for _ in 1:120] for _ in 1:100];
julia> jldsave(fname_compressed, true; data);

JLD2 will try to individually compress each of the 100*120 short arrays. As said above, they can't really be compressed and on top of that a bit of metadata for the compression library has to be added as well.

When you use the "external" compression, what you are seeing is the compression of the JLD2 metadata.
(For every individual array, there is some metadata describing the element type and the shape and so on. That is exactly the same 12000 times)

For comparison: Here's what you get with a modified example with 10 random integers between 1:10

julia> data = [[rand(1:10, 10) for _ in 1:120] for _ in 1:100]; # 10 random integers in the interval 1:10

julia> fname = tempname() * ".jld2";

julia> jldsave(fname, false; data);

julia> run(`du -h $fname`);
2.2M	/tmp/jl_l6bg9sbqrY.jld2

julia> fname_compressed = tempname() * ".jld2";

julia> jldsave(fname_compressed, true; data);

julia> run(`du -h $fname_compressed`);
2.0M	/tmp/jl_f4la5I3a7Q.jld2

julia> zipname = tempname() * ".zip";

julia> run(`zip $zipname $fname_compressed`);
  adding: tmp/jl_f4la5I3a7Q.jld2 (deflated 81%)

julia> run(`du -h $zipname`);
372K	/tmp/jl_Zov0zCfy1i.zip

julia> fname_compressed_flat = tempname() * ".jld2";

julia> jldsave(fname_compressed_flat, true; data = vcat(vcat(data...)...));

julia> run(`du -h $fname_compressed_flat`);
88K	/tmp/jl_BWDcvgvCX6.jld2

julia> fname_uncompressed_flat = tempname() * ".jld2";

julia> jldsave(fname_uncompressed_flat, false; data = vcat(vcat(data...)...));

julia> run(`du -h $fname_uncompressed_flat`);
940K	/tmp/jl_JGJqdqjeH8.jld2

julia> run(`zip $zipname $fname_uncompressed_flat`);
  adding: tmp/jl_JGJqdqjeH8.jld2 (deflated 91%)

julia> run(`du -h $zipname`);
460K	/tmp/jl_Zov0zCfy1i.zip

Here, you can see that the most efficient way is to use JLD2 compression with a single flattened dataset.
When working with floating point number, I would not recommend using compression.
There is significant compute involved and the compression level is usually not sufficiently large.

jbrea · 2024-12-19T07:43:30Z

I see, thanks for the explanation. Would it therefore make sense to check if the metadata can be compressed, when jldsave is called with compression?

JonasIsensee · 2024-12-19T12:54:57Z

No, that does not really make sense. JLD2 should always be able to open a file even without having compression libraries installed. (It may not be able to read the dataset but it can at least say, what library needs to be loaded to do so.)

Of course, you can always try to externally compress the whole file.

JonasIsensee closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger file size with compression than without #625

Larger file size with compression than without #625

jbrea commented Dec 18, 2024

JonasIsensee commented Dec 18, 2024 •

edited

Loading

jbrea commented Dec 19, 2024

JonasIsensee commented Dec 19, 2024

Larger file size with compression than without #625

Larger file size with compression than without #625

Comments

jbrea commented Dec 18, 2024

JonasIsensee commented Dec 18, 2024 • edited Loading

jbrea commented Dec 19, 2024

JonasIsensee commented Dec 19, 2024

JonasIsensee commented Dec 18, 2024 •

edited

Loading