CMPTO_J2k Encoding / Decoding Latency #416

slm-sean · 2024-09-25T03:54:24Z

Hi,

As suggested, I wanted to look into seeing if there is a way to reduce the latency of the CMPTO J2K codec. We have measured the latency by comparing the source signal on a reference monitor with burned in timecode, side by side with a monitor of the same model displaying the output of and Ultragrid decoder.

We have observed a latency of approx 4-6 frames when encoding an 8-10bit 422-444 signal. The latency seems to increase by several frames when encoding 12bit 444 to approx 6-7 frames.

Below are screenshots of the reported video encoding times and the corresponding settings. I have noticed, as expected, that reducing the quality reduces the encoding time of each frame. I have not yet verified if this results in a reduction in end to end latency, which I am hoping to test soon. Is this the behaviour I should expect to see?

UHD 444 10bit - Quality=1 - MCT Enabled - Tiles=1 - Pool=1

UHD 444 10bit - Quality=.5 - MCT Enabled - Tiles=1 - Pool=1

UHD 444 12bit - Quality=1 - MCT Enabled - Tiles=1 - Pool=1

UHD 444 12bit - Quality=.6 - MCT Enabled - Tiles=1 - Pool=1

UHD 444 12bit - Quality=.5 - MCT Enabled - Tiles=1 - Pool=1

For our use case, we feel that this solution is right on the edge of being acceptable for the majority of our users, so even a reduction of 2-3 frames could greatly improve the experience and accuracy of the remote work being done.

Thanks in advance.

MartinPulec · 2024-10-23T08:57:21Z

(just for the reference, this relates to GH-406, eg. this comment)

MartinPulec · 2024-10-23T10:21:38Z

Hi, I've played with it a bit on a GeForce 1080 Ti, the baseline command is:

uv -VV -F ratelimit:5 -t testcard:patt=noise:mod=dci4:c=R10k:fps=30:quality=X.Y \
    -c cmpto_j2k:rate=10M:pool_size=1:tile_limit=1

(tile_limit and pool_size are set to comply with your settings)

The important thing is the ratelimit option - I've added it to get the latency when the encoder is not fully utilized. Here are results that I got:

quality	latency (duration) in ms	latency (ms) without ratelimit
0.3	32.0 ± 0.1	28.7 ± 0.1
0.5	44.3 ± 0.2	75.1 ± 0.3
1.0	69.9 ± 0.3	127.0 ± 0.2

(it seems to work in a similar fashion if R12L is used instead)

So my thoughts are following:

quality (obviously) increases the latency when the device is not fully utilized, but it seems to scale rather linearly
if the device is fully utilized, then the latency grows up to the buffer sizes (it will be more obvious if the pool_size is kept to default, which is 4)
in the case 2 above, the FPS drops to approximately 1/duration. In the case of 215 ms you've mentioned it should be around 5.

So at least according to my measurements it seems like the latency scales rather linearly to the quality unless the performance is the bottleneck.

Can you confirm the conclusions? Namely that for the 215 ms case you get ~5 fps and that/if -F ratelimit:3 improves the duration? I was measuring with synthetic content but the results can be different for a real video content.

slm-sean · 2024-10-24T01:46:27Z

Hi Martin,

Here are my results based on video data captured via a blackmagic IO card. The signal captured is UHD 23.98 4:4:4 12 bit (R12L). I have switched systems at this point for my testing platform as the GPU I was previously using died. I am now using a 1080ti for my testing so I expect our results to be fairly similar.

quality	latency (duration) in ms	latency (ms) without ratelimit
0.3	25 ± 5	27 ± 2
0.5	33 ± 6	33 ± 2
1.0	61 ± 3	113 ± 2

We are seeing pretty similar results from the looks of it. If I rate limit to 3 with a quality setting of 1.0 the duration is around 62-65ms, in line with the results of quality 1.0 and ratelimit of 5.

I agree, increasing quality increases latency and it scales linearly
I have also found this as well. Increasing pool to 4 increases the latency to ~144ms if quality is set to 1.0
In the case of per frame latency being 215ms, I can not confirm what our FPS was. With my current configuration, and a latency of 113 without rate limiting, I am achieving around 16 FPS

My colleagues and I have been able to confirm a few things:

Reducing the quality does not reduce the end to end latency of the stream. We are still seeing 5-6 frames latency if we are setting quality to 1.0 or 0.6, etc, assuming we are able to achieve full frame rate depending on the source image.
On the encoding side, setting the tile and pool limit to 1 shaved off 1 frame of latency.
I was able to shave off an additional frame of latency by reducing the playout buffer down to 1ms (I believe it was originally set to 32?)

These changes can have negative effects on stream stability, so in the long term I hope to find an alternative way to reduce latency if possible.

Do you know how much latency is introduced by the decklink capture / streaming protocol / decklink output process? My assumption was that we should expect 1 frame for each of those steps in addition to the per frame encoding time. I don't have any evidence to back up those numbers though.

alatteri · 2024-10-24T11:50:18Z

Sorry if this seems rude, but a 1080Ti is a rather old card. Would these issues be solved simply by utilizing a newer generation GPU?

slm-sean · 2024-10-24T14:02:30Z

Hey, not rude at all. I don't think the issue here is GPU power. What i'm finding is that even if i lower the quality of the stream, which reduces the per frame encode duration, I don't necessarily see an improvement in latency.

We are currently comparing to another service that we believe uses ultragrid and comprimato on the back end and they appear to be several frames faster (~3 frames) than the main branch of ultragrid, and I'm trying to find where the additional latency is coming from. I could be off the mark here and perhaps the comprimato encode / decode process is not the issue.

I have several systems deployed using ada 4000 gpus and while they can handle more complex heavy noise based encodes / decodes than the 1080ti can and more simultaneous streams as the 1080ti, I don't see an end to end latency improvement for streams that are not utilizing 100% of our gpu's capabilities.

Assuming we can keep the per frame encoding duration and decoding duration below 40ms, that should only add 2 frames of latency end to end (excluding any issues with the signal chain upstream and downstream of the ultragrid process). I am currently measuring 5-6 frames latency. We were able to shave 2 frames off by reducing the pool size of the encoder and by modifying a buffer within ultragrid that I believe is there to accommodate packet retransmission or reordering, but the modification of that buffer is not ideal.

Other perspectives are always welcome!

MartinPulec · 2024-11-08T08:51:59Z

I was able to shave off an additional frame of latency by reducing the playout buffer down to 1ms (I believe it was originally set to 32?)

Do you mean playout buffer delay? It is in milliseconds, so 32 ms. But the value of 32 is just the initial value, it is still overridden by 1/fps later, anyways.

Do you know how much latency is introduced by the decklink capture / streaming protocol / decklink output process? My assumption was that we should expect 1 frame for each of those steps in addition to the per frame encoding time. I don't have any evidence to back up those numbers though.

More or less so. The compression can vary, as you can see. We did some evaluation in the past, but it isn't up-to date. At the time, the latency was mostly influenced by the latency of displaying device, which has improved with newer devices, so 4K Extreme values may be representative these days.

MartinPulec · 2024-11-08T08:57:48Z

Reducing the quality does not reduce the end to end latency of the stream. We are still seeing 5-6 frames latency if we are setting quality to 1.0 or 0.6, etc, assuming we are able to achieve full frame rate depending on the source image.

but you must also add the actual compression and decompression duration, so provided that we assume the baseline latency eg. of 3 frames as on the linked page, you must add a frame time for compression and another for decompression. Which would yield 5 frames, so that it seems to be legit values for me. Having eg. 1 frame E2E latency (without compression/decompression, to yield 3 overall) doesn't sound realistic to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMPTO_J2k Encoding / Decoding Latency #416

CMPTO_J2k Encoding / Decoding Latency #416

slm-sean commented Sep 25, 2024

MartinPulec commented Oct 23, 2024

MartinPulec commented Oct 23, 2024

slm-sean commented Oct 24, 2024

alatteri commented Oct 24, 2024

slm-sean commented Oct 24, 2024

MartinPulec commented Nov 8, 2024

MartinPulec commented Nov 8, 2024

CMPTO_J2k Encoding / Decoding Latency #416

CMPTO_J2k Encoding / Decoding Latency #416

Comments

slm-sean commented Sep 25, 2024

MartinPulec commented Oct 23, 2024

MartinPulec commented Oct 23, 2024

slm-sean commented Oct 24, 2024

alatteri commented Oct 24, 2024

slm-sean commented Oct 24, 2024

MartinPulec commented Nov 8, 2024

MartinPulec commented Nov 8, 2024