About Pixel Shuffle #21

chautuankien · 2022-11-24T08:29:28Z

It is very interesting that you use Pixel Shuffle and Channel Attention for motion estimation without estimating optical flow.

I want to ask that in the paper you said that using Pixel Shuffle to maintain the large receptive field, so I want to ask how PS can do that.

One more question, in VFI, I usually see that people will use again the input images to reconstruct the color for the middle. So how just by applying Up Shuffle you can synthesize the middle frame?

Thank you.

myungsub · 2022-11-27T07:42:22Z

Hi @chautuankien , thanks for you interest in our work.

I want to ask that in the paper you said that using Pixel Shuffle to maintain the large receptive field, so I want to ask how PS can do that.

PixelShuffle downscales the spatial resolution (H x W) and increases the channel dimension (C), so applying convolution with the same kernel size can cover a larger region.

For instance, if you apply a 3x3(xC) kernel to a H x W x C, the receptive field will be just 3 x 3, but if you "downshuffle" the data to H/2 x W/2 x 4C and apply the 3x3(x4C) kernel, the receptive field will be twice larger for each spatial dimension.

One more question, in VFI, I usually see that people will use again the input images to reconstruct the color for the middle. So how just by applying Up Shuffle you can synthesize the middle frame?

From what I've understood, I think you're talking about optical flow based models that use the input images for warping. Our model focuses on direct synthesis without flow-based warping, so the method is very different. There are pros and cons for each method, but flow-based works are more popular these days, to be frank..

chautuankien · 2022-11-28T05:13:31Z

Thank you so much for your reply.

So, for the first question, how PS works is like Pooling layer, right? For example, in case of Max Pooling of stride 2, it chooses the maximum value in a 2x2 grid, to down-sampling H x W to H/2 x W/2. Therefore, the receptive field will be twice larger.

For the second question, from what I've understood, is your method a CNN-based method? You will use CNNs to directly synthesize the intermediate frame.

Another question is, why did you choose to down-shuffle only once but not more? (just like an encoder-decoder-based network, where Pooling layer is used more than once to down-sampling the data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Pixel Shuffle #21

About Pixel Shuffle #21

chautuankien commented Nov 24, 2022

myungsub commented Nov 27, 2022

chautuankien commented Nov 28, 2022 •

edited

Loading

About Pixel Shuffle #21

About Pixel Shuffle #21

Comments

chautuankien commented Nov 24, 2022

myungsub commented Nov 27, 2022

chautuankien commented Nov 28, 2022 • edited Loading

chautuankien commented Nov 28, 2022 •

edited

Loading