Question: Video vs Frames? #24

cbasavaraj · 2024-11-26T15:55:24Z

Hi, I'm running the code on my machine and it works fine. I notice that you are sampling 1 frame / second. So there is no smarter way of sampling frames only when needed, for example, a scene change? I have read the paper and know that such things are done within the model, but the initial sampling is just 1 frame / second?

xiaoqian-shen · 2024-11-27T15:32:43Z

Yes, we initially sample video at 1 fps and then decide which frames to reduce based on extracted features representation.

cbasavaraj · 2024-11-28T09:38:50Z

Thank you!

Some more questions on how the chat works: In the provided app.py with the chat interface on the browser, I am loading one video and asking multiple questions. Each turn takes about the same time for inference. So, the visual elements are processed every time, and the previous turns' intermediate outputs are not reused? Also, is the text chat history (questions and answers) reused for later queries?

xiaoqian-shen · 2024-12-05T11:06:56Z

Hi @cbasavaraj,

The main processing time is taken by frame feature extraction. Theoretically, you can save the features for multi-round questions for the same video, and use KV cache to reuse the previous token caches. It is just because we did not implement in our demo.

cbasavaraj · 2024-12-18T06:20:36Z

Got it, thank you for the response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Video vs Frames? #24

Question: Video vs Frames? #24

cbasavaraj commented Nov 26, 2024

xiaoqian-shen commented Nov 27, 2024 •

edited

Loading

cbasavaraj commented Nov 28, 2024

xiaoqian-shen commented Dec 5, 2024

cbasavaraj commented Dec 18, 2024

Question: Video vs Frames? #24

Question: Video vs Frames? #24

Comments

cbasavaraj commented Nov 26, 2024

xiaoqian-shen commented Nov 27, 2024 • edited Loading

cbasavaraj commented Nov 28, 2024

xiaoqian-shen commented Dec 5, 2024

cbasavaraj commented Dec 18, 2024

xiaoqian-shen commented Nov 27, 2024 •

edited

Loading