-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Video vs Frames? #24
Comments
Yes, we initially sample video at 1 fps and then decide which frames to reduce based on extracted features representation. |
Thank you! Some more questions on how the chat works: In the provided app.py with the chat interface on the browser, I am loading one video and asking multiple questions. Each turn takes about the same time for inference. So, the visual elements are processed every time, and the previous turns' intermediate outputs are not reused? Also, is the text chat history (questions and answers) reused for later queries? |
Hi @cbasavaraj, The main processing time is taken by frame feature extraction. Theoretically, you can save the features for multi-round questions for the same video, and use KV cache to reuse the previous token caches. It is just because we did not implement in our demo. |
Got it, thank you for the response! |
Hi, I'm running the code on my machine and it works fine. I notice that you are sampling 1 frame / second. So there is no smarter way of sampling frames only when needed, for example, a scene change? I have read the paper and know that such things are done within the model, but the initial sampling is just 1 frame / second?
The text was updated successfully, but these errors were encountered: