Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data decompresses with C++ but exceptions when decompressed with managed lzma #32

Open
JohnJScott opened this issue Apr 23, 2020 · 6 comments
Assignees
Labels

Comments

@JohnJScott
Copy link

As the subject states - I have a data block which is 0x100043 in length that decompresses to 0x100000 fine in C++ (code below), but exceptions with a bad data error in Lzma2Dec_DecodeToDic() line 323 when using managed lzma.

It could be possible I am using the API incorrectly, that would be the first thing to verify.

The Lzma2Dec_DecodeToDic functions in managed lzma and C++ lzma 1900 are just different enough that there's no easy change to make things work. It looks like the C# is a direct port of a different version of the C++.

C++ decode (which works fine)

TEST_METHOD_CATEGORY( SimpleDecompressBlob, "SevenZip" )
	{
		Log( "7-zip decompress blob test" );
		SetWorkingDirectoryForTests();

		size_t source_length = 0;
		uint8* compressed_memory = LoadFileToMemory( "TestData/Binary.dat", &source_length );

		size_t dest_length = 1024 * 1024;

		uint8* dest = static_cast< uint8* >( calloc( 1, dest_length ) );

		ISzAlloc alloc = { SevenZipMalloc, SevenZipFree };
		ELzmaStatus status = LZMA_STATUS_NOT_SPECIFIED;
		SRes result = Lzma2Decode( dest, &dest_length, compressed_memory, &source_length, 12, ELzmaFinishMode::LZMA_FINISH_END, &status, &alloc );
		Assert::IsTrue( result == SZ_OK, L"Lzma2Decode returned error" );

		free( dest );
	}

C# code - which exceptions (for testing, I put this at the top of sandbox-7z Program.cs)

        public class FLzma2Data
        {
	        public byte[] SourceData = ( byte[] )null;
	        public int SourceOffset = -1;
	        public int SourceLength = -1;
	        public byte[] DestinationData = ( byte[] )null;
	        public int DestinationOffset = -1;
	        public int DestinationLength = -1;
	        public byte DictionarySize = 40;
        }

	[STAThread]
        static void Main()
        {
	        FLzma2Data data = new FLzma2Data();

	        FileInfo Info = new FileInfo( "d:/Temp/Binary.dat" );
	        BinaryReader reader = new BinaryReader( Info.OpenRead() );
	        data.SourceData = reader.ReadBytes( ( int )Info.Length );
	        data.SourceLength = ( int )Info.Length;
	        data.SourceOffset = 0;
	        reader.Close();

	        data.DestinationData = new byte[1024 * 1024];
	        data.DestinationLength = 1024 * 1024;
	        data.DestinationOffset = 0;

	        data.DictionarySize = 12;

	        ManagedLzma.LZMA2.Decoder decoder = new ManagedLzma.LZMA2.Decoder( new ManagedLzma.LZMA2.DecoderSettings( 12 ) );
	        decoder.Decode( data.SourceData, data.SourceOffset, data.SourceLength, data.DestinationLength, true );
	        decoder.ReadOutputData( data.DestinationData, data.DestinationOffset, data.DestinationLength );

I could send you the file, but I don't want to bog down the forum with a meg of uncompressable data =)

What are your recommendations?

Cheers
John

@weltkante
Copy link
Owner

weltkante commented Apr 23, 2020

I'll have a look later today and see if I can reproduce this, I don't see any immediate mistake in the calling code you posted. If you want to speed this up put your file somewhere you can take it back down again, send me an email, or tell me the compression settings and library you used if you compressed the data yourself. Otherwise I'll just try various settings to see if its a general problem.

The SDK version used while porting was 9.22 (included in the repo) so yeah its way older than 19.00, but I checked a year or two ago and there haven't been any relevant changes to the code (which makes sense since it needs to remain compatible). I'll check again of course to see if there are any recent bugfixes for edge cases.

@JohnJScott
Copy link
Author

JohnJScott commented Apr 23, 2020 via email

@weltkante
Copy link
Owner

weltkante commented Apr 23, 2020

Thanks, you can take down the file if you want.

I was able to reproduce the issue and it also occurs when using the 9.22 C++ source (i.e. the 9.22 C++ API is able to decode). The "problem" is that your C++ snippet is using the "One Call Interface" while ManagedLzma is using the incremental API. Replicating what the "one call interface" does allows to decode in C# so its not a bug in the low level decoding routines, but rather in how they are called.

Either LZMA2 itself has an edge case bug in the incremental API or I'm using it incorrectly. I'll have to write C++ code to decode your file incrementally and see if that succeeds. Might take a few days to figure everything out, sorry. I'll probably have something ready early next week.

It seems that an end mark is written even though I don’t request it - that’s a problem for a different day.

Its been a while but I think the "end mark" is a feature only used in LZMA, not in LZMA2 - the way LZMA2 works is by splitting the source into chunks so they can be compressed in parallel (or be stored uncompressed if compression doesn't reduce size) - either way it stores the size of each chunk in a control code, so as far as LZMA2 is concerned it will always know where the stream ended and return the "ended with mark" status, even if the last chunk was LZMA and ended with "maybe ended" flag it replaces that because it has better information.

@weltkante
Copy link
Owner

weltkante commented Apr 24, 2020

Found the problem, I have a bug in translating the eof argument of the Decode method when you simultanously pass a huge input buffer. The C# decoders are built for streaming and incremental decoding, so passing a huge input buffer can exceed the capacities of the decoding algorithm, I need to detect and handle this case correctly.

Technical details: I need to translate eof because the C++ code demands knowing when the output stream ends, not when the input stream ends. This doesn't fit well with C# APIs which usually are designed around signaling when the input stream ends, so I designed the API to be compatible with that pattern in mind.

There are two things to note:

  • your snippet ignores the return value of Decode and just assumes it processed all data, this is not necessarily the case; you'll have to write a loop to properly decode
  • if you fix your code and write a loop you can easily work around my bug by leaving eof false until there is only little input remaining

Sorry for not providing a "one call" decoding API but .NET is not good with large buffers. The decoder needs to allocate internal buffers and I don't like having an API which can become unstable and "sometimes" throw OOM depending on the memory fragmentation of the rest of the .NET program. Things are designed to keep buffers at reasonable sizes, but that means streaming is mandatory.

Here's an example for how the loop can look like for your snippet, including a workaround for the eof detection (I think 16 byte are good enough for a safety margin, if not there are other ways to detect this, but really it should be done automatically so I'm leaving it at this for now and instead work on the proper fix)

int pInput = 0;
int pOutput = 0;
using (var decoder = new Decoder(new DecoderSettings(12)))
{
    do
    {
        pInput += decoder.Decode(data.SourceData, data.SourceOffset + pInput, data.SourceLength - pInput, data.DestinationLength - pOutput, data.SourceLength - pInput < 16);
        pOutput += decoder.ReadOutputData(data.DestinationData, data.DestinationOffset + pOutput, data.DestinationLength - pOutput);
    }
    while (!decoder.IsOutputComplete);
}

@weltkante weltkante added the bug label Apr 24, 2020
@weltkante weltkante self-assigned this Apr 24, 2020
@JohnJScott
Copy link
Author

JohnJScott commented Apr 25, 2020 via email

@weltkante
Copy link
Owner

weltkante commented Apr 25, 2020

It returns that it generated the correct amount of decompressed data (0x100000) and processed all but a single byte of the source data (0x100042 of 0x100043). If I change the finish mode to LZMA_FINISH_END then it reports that all source data has been consumed (0x100043 Bytes).

Note that the returned status is LZMA_STATUS_NOT_FINISHED meaning the decoder can't tell you if the stream is really done or if another chunk begins if more data were present. Might be a design flaw in the decoder because from looking at the code it might be able to differentiate the cases if it tried harder, but as it is you need to use LZMA_FINISH_END to make sure you didn't observe a truncated stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants