Understanding the Joint Autoregressive Model #165

rezafuru · 2022-09-30T13:35:07Z

rezafuru
Sep 30, 2022

Hi,

I'm having some difficulties to wrap my head around why we can parallelize the entropy estimation during a forward pass for training but need to sequentially compute the scale and mean at each spatial location (h, w) when compressing the latent variable.

I think I understand the general idea of Autoregressive models, where we want to mask unseen variables s.t. we can model the conditional probabilities. However, isn't the same masked convolution in the context prediction (and the 1x1 convolutions in the entropy parameters network) already ensuring that each index of the computed gaussian parameters is context dependent? That is, by my understanding, the way the weights are applied in the convolutions should ensure that the output scales and weights are computed autoregressively.

So why do we need to apply the masked convolution and entropy parameter network patch wise during compression and not parallelize the process the same as in the pure latent variable models, i.e. FP, SHP, MSHP?

BR

Answered by YodaEmbedding

Oct 1, 2022

For decompress(), sequential decoding of each pixel is necessary due to causality.

For compress(), as you note, it is not necessary to run a sequential pixel-by-pixel computation since one may simply use the masked convolution to generate the exact y_hat that is available to the decoder in a single GPU call, à la forward(). This can probably be optimized in the CompressAI implementation. The only issue I see is if the y_hat that is generated is slightly different due to e.g. floating point errors or other optimizations. The safest method is simply to repeat the exact sequence of computations for encoding as for decoding, which always works on the same hardware, if we assume that the hardw…

View full answer

YodaEmbedding · 2022-10-01T00:36:40Z

YodaEmbedding
Oct 1, 2022

For decompress(), sequential decoding of each pixel is necessary due to causality.

For compress(), as you note, it is not necessary to run a sequential pixel-by-pixel computation since one may simply use the masked convolution to generate the exact y_hat that is available to the decoder in a single GPU call, à la forward(). This can probably be optimized in the CompressAI implementation. The only issue I see is if the y_hat that is generated is slightly different due to e.g. floating point errors or other optimizations. The safest method is simply to repeat the exact sequence of computations for encoding as for decoding, which always works on the same hardware, if we assume that the hardware is deterministic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the Joint Autoregressive Model #165

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Understanding the Joint Autoregressive Model #165

rezafuru Sep 30, 2022

Replies: 1 comment

YodaEmbedding Oct 1, 2022

rezafuru
Sep 30, 2022

YodaEmbedding
Oct 1, 2022