Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Padding is not handled gracefully #31

Closed
maleadt opened this issue Jun 17, 2024 · 4 comments
Closed

Padding is not handled gracefully #31

maleadt opened this issue Jun 17, 2024 · 4 comments

Comments

@maleadt
Copy link

maleadt commented Jun 17, 2024

MWE:

using CodecBzip2: transcode, Bzip2Compressor, Bzip2Decompressor

function main()
    # generate some data
    input = Pipe()
    output = Pipe()
    cmd = `bzip2`
    proc = run(pipeline(cmd, stdout=output, stdin=input); wait=false)
    close(output.in)
    writer = @async begin
        write(input, "Hello, world!")
        close(input)
    end
    reader = @async read(output)
    wait(proc)
    compressed = fetch(reader)

    # add some padding
    push!(compressed, 0)

    # verify we can decompress using `bunzip2`
    mktempdir() do dir
        path = joinpath(dir, "test.bz2")
        write(path, compressed)
        run(`bunzip2 $path`)
        println(read(joinpath(dir, "test"), String))
    end

    uncompressed = transcode(Bzip2Decompressor, compressed)
    String(uncompressed)
end

The bunzip2 tool generates a warning, but continues to decompress:

bunzip2: /var/folders/5m/zq0fq7r91f7_5qb1c31vgy5h0000gn/T/jl_B16lhs/test.bz2: trailing garbage after EOF ignored
Hello, world!

CodecBzip2.jl fails:

ERROR: CodecBzip2.BZ2Error(-5)

... where -5 seems to be BZ_DATA_ERROR_MAGIC.

@nhz2
Copy link
Member

nhz2 commented Dec 17, 2024

What code is producing a bzip2 format data stream with trailing garbage? Is this part of some other format?
Without more context, I'm not sure how to fix this issue.

@maleadt
Copy link
Author

maleadt commented Dec 17, 2024

Indeed, part of another file format where I know how large the compressed section is. I now have to jump through some hoops to find the end markers and truncate the section so that the decompressor can handle it: https://github.com/JuliaGPU/Metal.jl/blob/60a9e34ebc98714a705af1d28b47bff67f25dcb9/src/compiler/library.jl#L339-L385

@nhz2
Copy link
Member

nhz2 commented Dec 17, 2024

Okay, I think you want to stop at the "end of a chunk"

There is no simple function for doing this but the following should work:

julia> function decode_first_bzip2_data_stream(compressed::Vector{UInt8}; max_size=typemax(Int))
           stream = Bzip2DecompressorStream(IOBuffer(compressed); stop_on_end=true)
           try
               u = read(stream, max_size)
               eof(stream) || error("max_size is too small")
               return u
           finally
               close(stream) # needed to prevent memory leaks
           end
       end
decode_first_bzip2_data_stream (generic function with 1 method)

julia> u = zeros(UInt8, 1000000);

julia> c = transcode(Bzip2Compressor, u);

julia> decode_first_bzip2_data_stream(c) == u
true

julia> decode_first_bzip2_data_stream([c; c;]) == u
true

julia> decode_first_bzip2_data_stream([c; zeros(UInt8,10);]) == u
true

julia> decode_first_bzip2_data_stream(c; max_size=20)
ERROR: max_size is too small
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] decode_first_bzip2_data_stream(compressed::Vector{UInt8}; max_size::Int64)
   @ Main ./REPL[20]:5
 [3] top-level scope
   @ REPL[26]:1

@nhz2
Copy link
Member

nhz2 commented Dec 18, 2024

With #43 this can be simplified to:

julia> function decode_first_bzip2_data_stream(compressed::Vector{UInt8}; max_size=typemax(Int))
           stream = Bzip2DecompressorStream(IOBuffer(compressed); stop_on_end=true)
           u = read(stream, max_size)
           eof(stream) || error("max_size is too small")
           return u
       end

Could you reopen this issue if it doesn't work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants