Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JpegDecoder: Optimize HuffmanScanDecoder #1607

Open
4 tasks done
saucecontrol opened this issue Apr 17, 2021 · 6 comments
Open
4 tasks done

JpegDecoder: Optimize HuffmanScanDecoder #1607

saucecontrol opened this issue Apr 17, 2021 · 6 comments

Comments

@saucecontrol
Copy link
Contributor

saucecontrol commented Apr 17, 2021

Prerequisites

  • I have written a descriptive issue title
  • I have verified that I am running the latest version of ImageSharp
  • I have verified if the problem exist in both DEBUG and RELEASE mode
  • I have searched open and closed issues to ensure it has not already been reported

Description

To pair with #1597, I have identified some opportunities for improvement in baseline scan decoding performance (haven't looked at progressive yet, but the same ideas might apply).

I had a look through the asm from HuffmanScanDecoder.DecodeBlockBaseline(), and while there are some small codegen issues that are easy to fix without refactoring, I was only able to knock about 70 bytes off the code -- out of a total method size of 2400 bytes including inlined callees -- which was barely measurable in the full decode benchmarks.

The bulk of that full method is made up of calls to BufferedReadStream.ReadByte(), which is inlined a total of 12 times into the outer method. BufferedReadStream itself is quite clever in that it allows ReadByte() to be devirtualized and inlined, but each byte read still comes with significant overhead.

public override int ReadByte()
{
if (this.readerPosition >= this.Length)
{
return -1;
}
// Our buffer has been read.
// We need to refill and start again.
if (this.readBufferIndex > this.maxBufferIndex)
{
this.FillReadBuffer();
}
this.readerPosition++;
unsafe
{
return this.pinnedReadBuffer[this.readBufferIndex++];
}
}

When inlined into the caller compiles to this x64 asm:

 mov         r8,qword ptr [r14]             ; read 8-byte BufferedReadStream base pointer to r8
 mov         rcx,qword ptr [r8+28h]         ; read 8-byte `readerPosition` field
 cmp         rcx,qword ptr [r8+30h]         ; read 8-byte `Length` property
 jl          LENGTH_OK                      ; if `readerPosition` < `Length`, continue to read
 mov         r9d,0FFFFFFFFh                 ; else load -1
 jmp         END                            ;   and return it
LENGTH_OK:
 mov         ecx,dword ptr [r8+3Ch]         ; read 4-byte `readBufferIndex` field
 cmp         ecx,dword ptr [r8+38h]         ; read 4-byte `maxBufferIndex` field
 jle         BUFFER_FILLED                  ; if `readBufferIndex` <= `maxBufferIndex`, continue to buffer read
 mov         qword ptr [rsp+28h],r8         ; else spill r8 to stack (it may be overwritten by method call)
 mov         rcx,r8                         ;   set up `this` arg for method
 call        FillReadBuffer()               ;   <-- that
 mov         r8,qword ptr [rsp+28h]         ;   restore r8 from stack
BUFFER_FILLED:
 inc         qword ptr [r8+28h]             ; increment `readerPosition` field
 mov         rcx,qword ptr [r8+20h]         ; read 8-byte `pinnedReadBuffer` field
 mov         r9d,dword ptr [r8+3Ch]         ; read 4-byte `readBufferIndex` field
 lea         r10d,[r9+1]                    ;   increment it
 mov         dword ptr [r8+3Ch],r10d        ;   write it back
 movsxd      r8,r9d                         ; sign-extend `readBufferIndex` to native int length
 movzx       r9d,byte ptr [rcx+r8]          ; read 1 byte @ `pinnedReadBuffer`+`readBufferIndex` and return it
END:

The movsxd could be eliminated by using a native-sized readBufferIndex, but it would still be a huge amount of overhead for fetching a single byte of data.

Additionally, the huffman decoder has checks and recovery built in for handling bad data which add overhead -- not necessarily to execution in this case since they're unlikely to be hit, but they add to code size.

I would think (and I'm about show my ignorance of JPEG huffman coding so correct me if I'm wrong) that it should be possible to build a happy path through the decoder that depends on eagerly peeking a number of bytes (one that should suffice for 99+% of all blocks) from the stream into a stack buffer and working through those without the need for individual byte loads. If that's something like 64 bytes since that would be 1:1 compression, that would be easy to satisfy. If it needs to handle a few degenerate cases where the compressed data is larger than the original, maybe double that?

The current resilient implementation could serve as a backstop so that if the happy path doesn't work for some reason, it could bail and restart that block with the slow path. This shouldn't happen often, and it should be only for cases where perf becomes secondary to security and correctness.

I don't have the time to work on restructuring the code just yet, but does that sound reasonable, @JimBobSquarePants?

@JimBobSquarePants
Copy link
Member

I'm assuming the worst offender is this section here?

private ulong GetBytes()
{
ulong temp = 0;
for (int i = 0; i < JpegConstants.Huffman.FetchLoop; i++)
{
int b = this.ReadStream();
// Found a marker.
if (b == JpegConstants.Markers.XFF)
{
int c = this.ReadStream();
while (c == JpegConstants.Markers.XFF)
{
// Loop here to discard any padding FF bytes on terminating marker,
// so that we can save a valid marker value.
c = this.ReadStream();
}
// We accept multiple FF bytes followed by a 0 as meaning a single FF data byte.
// This data pattern is not valid according to the standard.
if (c != 0)
{
this.Marker = (byte)c;
this.badData = true;
this.MarkerPosition = this.stream.Position - 2;
}
}
temp = (temp << 8) | (ulong)(long)b;
}
return temp;
}

I've considered something before using SSE to mask/compare 64 bits at a time there for a happy path which should save us a lot of time but never preloading the full MCU and doing using a fast/slow path for the whole thing is a new one. It would be amazing if we could pull that off!

If that's something you could find time for at some point it would be most welcome.

BTW if you're looking for some real stinkers to optimize take a look here #1476 (comment) . The Emit is a real stinker just now. I'm considering writing an buffered writer as a sticky plaster for some of the short writes but I reckon someone with your kind of mind could look at it and fine many better ways to improve it.

@saucecontrol
Copy link
Contributor Author

Yep, that'd be the one. I can't think of a way to SIMDfy huffman decoding since you're parsing an unknown number of bits per symbol, but I'll study up on the algorithm and give it some thought.

I can definitely see how the stream writes would be an issue in the encoder. Could probably handle that the same way with a happy path that works in a local 64 byte buffer and writes it out all at once on completion. I assume as well that if you're encoding, you can avoid the degenerate cases where the compressed data is larger, but maybe that can still happen when you work with a pre-defined dictionary. I have some reading to do...

@saucecontrol
Copy link
Contributor Author

Just a note in case anyone else wants to look at this before I get time. I forgot the values being encoded/decoded here are quantized DCT coefficients, so they're 16-bit (10-bit actually, but you know...). That makes the 1:1 compression case 128 bytes for the MCU. Additionally, the ITU spec (Rec T.81, Annex C) explicitly limits huffman code lengths to 16 bits, meaning the absolute worst case would be 256 bytes for the MCU. So this is doable both on the decoder and encoder side.

@JimBobSquarePants
Copy link
Member

@br3aker Would you consider this issue still relevant?

@br3aker
Copy link
Contributor

br3aker commented Dec 28, 2022

@br3aker Would you consider this issue still relevant?

Oh sorry I've absolutely missed your message. I'd say 'yes', I have a couple of WIP local branches which benefit from various huffman-related optimizations but they are not so major: ~5-7% in my current local branch. Maybe we can squeeze a bit more from it - who knows :)

Hopefully next year I'd be able to spend time on these optimizations - really want to push JPEG codec to be at least as fast as competitors.

@JimBobSquarePants
Copy link
Member

@br3aker you're a performance machine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants