You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for sharing your code! It's a great value to the community.
I am interested in using your model for online video understanding, using a stream of images as opposed to batch processing a video file. Is there some way to achieve this without retraining your model? For example by decomposing a sequence and forwarding latent tokens for context? Essentially I want to be able to comment video streams with TTS.
Cheers,
Theo
The text was updated successfully, but these errors were encountered:
Hi, thank you for sharing your code! It's a great value to the community.
I am interested in using your model for online video understanding, using a stream of images as opposed to batch processing a video file. Is there some way to achieve this without retraining your model? For example by decomposing a sequence and forwarding latent tokens for context? Essentially I want to be able to comment video streams with TTS.
Cheers,
Theo
The text was updated successfully, but these errors were encountered: